Move reduction loop into intrinsics.

Currently they are interpreted as sequential algorithm, but it's
possible to make some of these parallel in the future.

Also makes reduction less dissimilar from other instructions.

Test: berberis_all

Change-Id: I8d0847b78a7b723399ee1176026428b674387ea8
2 files changed