ARM64: toHalf() intrinsic for ARMv8

This CL implements an intrinsic for toHalf() method with
ARMv8.2 FP16 instructions.

This intrinsic implementation achieves bit-level compatibility with the
original Java implementation android.util.Half.toFloat().

The time required to execute the below code on Pixel3:
- Java implementation android.util.Half.toFloat():
    - big cluster only: 2136ms
    - little cluster only: 6442ms
- arm64 Intrinisic implementation:
    - big cluster only: 1347ms (~37% faster)
    - little cluster only: 4937ms (~ 23% faster)

int benchmarkToHalf() {
    int result = 0;
    // 5.9605E-8 is the smallest positive subnormal number that can be
    // represented by FP16. This is 0x33800032 in float bits.
    int raw_input = 0x33800032;
    long before = 0;
    long after = 0;
    before = System.currentTimeMillis();
    do {
        float input = Float.intBitsToFloat(raw_input);
        short output = FP16.toHalf(input);
        result += output;
    } while (++raw_input != 0x477fff00);
    // 65535 is the max possible integer that can be represented by FP16.
    //This is 0x477fff00 in float bits.
    after = System.currentTimeMillis();
    System.out.println("Time of FP16.toHalf (ms): " + (after - before));
    return result;
}

Test: 580-fp16
Test: art/test/testrunner/run_build_test_target.py -j80 art-test-javac
Test: test-art-host, test-art-target

Change-Id: I69b152682390e5ffa5b3fdca60b496261191655d
11 files changed