math: new log

from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc

Assume __FP_FAST_FMA implies __builtin_fma is inlined as a single
instruction.

code size change: +4588 bytes (+2540 bytes with fma).
benchmark on x86_64 before, after, speedup:

-Os:
   log rthruput:  12.61 ns/call  7.95 ns/call 1.59x
    log latency:  41.64 ns/call 23.38 ns/call 1.78x
-O3:
   log rthruput:  12.51 ns/call  7.75 ns/call 1.61x
    log latency:  41.82 ns/call 23.55 ns/call 1.78x
3 files changed