Implement multi-thread CPU GEMM for BLAS Intrinsics

 - Multi-thread GEMM utilizes existing RS thread pool on top of
 - Large matrix-matrix multiplication is decomposed into multiple
   tiled matrix-matrix multiplications. Each thread iterates on
   the unfinished works.
 - The tiling applies to ONLY ONE dimension of each input matrix,
   and whether to tile X or Y depends on the transpose of the matrix.
 - The performance increase is proportional to the number of
   available CPU cores, for sufficiently large matrices.

Test: CTS test (rsblas) pass on Angler, Fugu and new devices.
      Performance test with RsBlasBenchmark and RsNeuralNet demo
      on Anger, Ryu, Seed, Shamu, Volantis, Fugu and new devices,
      showing roughly 70%(Volantix 2 core) ~ 400+%(Angler 8 core) perf gain.

Change-Id: If96f4119fd34d5d9d98a2542801495e7ffe577ae
(cherry picked from commit 41ab8faaf0d90238d42d8e2bbb7177467c10b4f6)
3 files changed
tree: e389fd2f60c8e4b943758d86c58002cbff9d97ec
  1. cpp/
  2. cpu_ref/
  3. driver/
  4. java/
  5. rsov/
  6. script_api/
  7. server/
  8. support/
  9. tests/
  10. Android.bp
  15. rs.h
  16. rs.spec
  17. rs_compat.spec
  18. rs_hal.h
  19. rsAllocation.cpp
  20. rsAllocation.h
  21. rsAnimation.cpp
  22. rsAnimation.h
  23. rsApiAllocation.cpp
  24. rsApiContext.cpp
  25. rsApiDevice.cpp
  26. rsApiElement.cpp
  27. rsApiFileA3D.cpp
  28. rsApiMesh.cpp
  29. rsApiType.cpp
  30. rsClosure.cpp
  31. rsClosure.h
  32. rsCompatibilityLib.cpp
  33. rsCompatibilityLib.h
  34. rsComponent.cpp
  35. rsComponent.h
  36. rsContext.cpp
  37. rsContext.h
  38. rsCppUtils.cpp
  39. rsCppUtils.h
  40. rsDebugHelper.h
  41. rsDefines.h
  42. rsDevice.cpp
  43. rsDevice.h
  44. rsDriverLoader.cpp
  45. rsElement.cpp
  46. rsElement.h
  47. rsEnv.h
  48. rsFBOCache.cpp
  49. rsFBOCache.h
  50. rsFifo.h
  51. rsFifoSocket.cpp
  52. rsFifoSocket.h
  53. rsFileA3D.cpp
  54. rsFileA3D.h
  55. rsFont.cpp
  56. rsFont.h
  57. rsg.spec
  58. rsg_generator.c
  59. rsgApi.cpp.rsg
  60. rsgApiFuncDecl.h.rsg
  61. rsgApiReplay.cpp.rsg
  62. rsgApiStructs.h.rsg
  63. rsGrallocConsumer.cpp
  64. rsGrallocConsumer.h
  65. rsInternalDefines.h
  66. rsList.h
  67. rsMap.h
  68. rsMatrix2x2.cpp
  69. rsMatrix2x2.h
  70. rsMatrix3x3.cpp
  71. rsMatrix3x3.h
  72. rsMatrix4x4.cpp
  73. rsMatrix4x4.h
  74. rsMesh.cpp
  75. rsMesh.h
  76. rsMutex.cpp
  77. rsMutex.h
  78. rsObjectBase.cpp
  79. rsObjectBase.h
  80. rsProgram.cpp
  81. rsProgram.h
  82. rsProgramBase.h
  83. rsProgramFragment.cpp
  84. rsProgramFragment.h
  85. rsProgramRaster.cpp
  86. rsProgramRaster.h
  87. rsProgramStore.cpp
  88. rsProgramStore.h
  89. rsProgramVertex.cpp
  90. rsProgramVertex.h
  91. rsRuntime.h
  92. rsSampler.cpp
  93. rsSampler.h
  94. rsScript.cpp
  95. rsScript.h
  96. rsScriptC.cpp
  97. rsScriptC.h
  98. rsScriptC_Lib.cpp
  99. rsScriptC_LibGL.cpp
  100. rsScriptGroup.cpp
  101. rsScriptGroup.h
  102. rsScriptGroup2.cpp
  103. rsScriptGroup2.h
  104. rsScriptGroupBase.h
  105. rsScriptIntrinsic.cpp
  106. rsScriptIntrinsic.h
  107. rsSignal.cpp
  108. rsSignal.h
  109. rsStream.cpp
  110. rsStream.h
  111. rsThreadIO.cpp
  112. rsThreadIO.h
  113. rsType.cpp
  114. rsType.h
  115. rsUtils.h
  117. spec.h
  118. spec.l