Optimization: added super efficient rowmajor * vector product (and vector * colmajor).
It basically performs 4 dot products at once reducing loads of the vector and improving
instructions scheduling. With 3 cache friendly algorithms, we now handle all product
configurations with outstanding perf for large matrices.
4 files changed