math: new log
from https://github.com/ARM-software/optimized-routines,
commit 04884bd04eac4b251da4026900010ea7d8850edc
Assume __FP_FAST_FMA implies __builtin_fma is inlined as a single
instruction.
code size change: +4588 bytes (+2540 bytes with fma).
benchmark on x86_64 before, after, speedup:
-Os:
log rthruput: 12.61 ns/call 7.95 ns/call 1.59x
log latency: 41.64 ns/call 23.38 ns/call 1.78x
-O3:
log rthruput: 12.51 ns/call 7.75 ns/call 1.61x
log latency: 41.82 ns/call 23.55 ns/call 1.78x
3 files changed