Improve performance of row-major-dense-matrix * vector products for recent CPUs.
This revised version does not bother about aligned loads/stores,
and rather processes 8 rows at ones for better instruction pipelining.
1 file changed