67b5f524ef188efb77f0dcff6e6bf483827e2f82 - platform/frameworks/rs

commit	67b5f524ef188efb77f0dcff6e6bf483827e2f82	[log] [tgz]
author	Miao Wang <miaowang@google.com>	Thu Sep 08 11:53:32 2016 -0700
committer	Miao Wang <miaowang@google.com>	Thu Jan 26 08:43:06 2017 -0800
tree	e389fd2f60c8e4b943758d86c58002cbff9d97ec
parent	4c41224c8c28f35f48f1d80394be715f9ad30e06 [diff]

Implement multi-thread CPU GEMM for BLAS Intrinsics - Multi-thread GEMM utilizes existing RS thread pool on top of Eigen. - Large matrix-matrix multiplication is decomposed into multiple tiled matrix-matrix multiplications. Each thread iterates on the unfinished works. - The tiling applies to ONLY ONE dimension of each input matrix, and whether to tile X or Y depends on the transpose of the matrix. - The performance increase is proportional to the number of available CPU cores, for sufficiently large matrices. Test: CTS test (rsblas) pass on Angler, Fugu and new devices. Performance test with RsBlasBenchmark and RsNeuralNet demo on Anger, Ryu, Seed, Shamu, Volantis, Fugu and new devices, showing roughly 70%(Volantix 2 core) ~ 400+%(Angler 8 core) perf gain. Change-Id: If96f4119fd34d5d9d98a2542801495e7ffe577ae (cherry picked from commit 41ab8faaf0d90238d42d8e2bbb7177467c10b4f6)