Fix performance when reading or writing large buffers.

Blur intrinsic which uses ~25mb of data would spill the
L2 cache when a smarter walking pattern could reduce this
hit.  We now vary the chunk size on both the processor
count and data size.

N7 execution time drops 1959ms to 930ms
Mako 470ms to 385ms
Manta, no change.

