Implement multi-thread CPU GEMM for BLAS Intrinsics
- Multi-thread GEMM utilizes existing RS thread pool on top of
Eigen.
- Large matrix-matrix multiplication is decomposed into multiple
tiled matrix-matrix multiplications. Each thread iterates on
the unfinished works.
- The tiling applies to ONLY ONE dimension of each input matrix,
and whether to tile X or Y depends on the transpose of the matrix.
- The performance increase is proportional to the number of
available CPU cores, for sufficiently large matrices.
Test: CTS test (rsblas) pass on Angler, Fugu and new devices.
Performance test with RsBlasBenchmark and RsNeuralNet demo
on Anger, Ryu, Seed, Shamu, Volantis, Fugu and new devices,
showing roughly 70%(Volantix 2 core) ~ 400+%(Angler 8 core) perf gain.
Change-Id: If96f4119fd34d5d9d98a2542801495e7ffe577ae
(cherry picked from commit 41ab8faaf0d90238d42d8e2bbb7177467c10b4f6)
3 files changed