math: add scalar erff.

In round-to-nearest mode the maximum error is 1.09 ULP.

Compared to glibc-2.28 erff: throughput is about 2.2x better,
latency is about 1.5x better on some AArch64 cores (on random
input in [-4,4]).

There are further optimization and quality improvement opportunities.
8 files changed