| Kernels in gemmlowp |
| ******************* |
| |
| |
| Kernels provide an inner-loop implementation, and a format |
| ========================================================== |
| |
| Here we assume familiarity with the concepts of kernels and of packing |
| as explained in doc/design.txt. |
| |
| gemmlowp is designed to be easily extensible to different architectures and |
| other low-level details, while achieving high performance. Thus a line had to |
| be drawn between the generic GEMM code and the specific parts that need to |
| be manually designed for each architecture, etc. The design choice made in |
| gemmlowp is to have easily swappable GEMM kernels. |
| |
| In itself, a GEMM kernel is just an implementation of the inner-most loop |
| in a GEMM (That inner-most loop has to be over the 'depth' dimension so as |
| to be able to accumulate into a small enough number of accumulators to fit |
| in registers). |
| |
| Thus, by itself, a GEMM kernel should be just a function computing a block |
| of GEMM. |
| |
| However, GEMM kernels may need to differ not just in how they implement this |
| computation, but also in the format of data that they operate on. Indeed, |
| in order to maximize the ratio of arithmetic instructions to memory access |
| instructions, GEMM kernels want to handle blocks as wide as possible given |
| the number of registers of the CPU architecture. |
| |
| Thus, in order to allow efficient specialization to diverse architectures, |
| gemmlowp allows each GEMM kernel to dictate the format of data that it expects, |
| in addition to providing its inner-loop implementation. |
| |
| The former is given by a 'Format' typedef, and the latter by a 'Run' |
| method. |
| |
| A good example is to look at internal/kernel_neon.h, and specifically at |
| the NEONKernel12x4Depth2 kernel, which specifies its format as |
| |
| typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, 3>, |
| KernelSideFormat<CellFormat<4, 2>, 1> > Format; |
| |
| The meaning of these terms is explained in the lengthy comment at the |
| top of internal/kernel.h. Here, they mean that this kernel handles at |
| each iteration (along the depth dimension): |
| - 3 'cells' of size 4x2 each of the lhs, so a total lhs block |
| of size 12x2 |
| - 1 'cell' of size 2x4 of the rhs. |
| In other words, this kernel handles 12 rows of the lhs and 4 columns of the |
| rhs, and handles two levels of depth at once. The 'cells' and 'CellFormat' |
| details the layout of these 12x2 and 2x4 blocks. |
| |
| This kernel then loads these 12x2 and 2x4 blocks and computes the corresponding |
| 12x4 GEMM; for ease of reference let us paste the critical comment and code here: |
| |
| "loop_NEONKernel12x4Depth2_%=:\n" |
| |
| // Overview of register layout: |
| // |
| // A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0). |
| // A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in d2--d7 |
| // (q1--q3). |
| // A 12x4 block of accumulators is stored in 32bit in q4--q15. |
| // |
| // +-----+-----+-----+-----+ |
| // |d0[0]|d0[1]|d0[2]|d0[3]| |
| // Rhs +-----+-----+-----+-----+ |
| // |d1[0]|d1[1]|d1[2]|d1[3]| |
| // +-----+-----+-----+-----+ |
| // |
| // | | | | | |
| // |
| // Lhs | | | | | |
| // |
| // +--+--+ - - - - +-----+-----+-----+-----+ |
| // |d2|d3| | q4 | q5 | q6 | q7 | |
| // |d2|d3| | q4 | q5 | q6 | q7 | |
| // |d2|d3| | q4 | q5 | q6 | q7 | |
| // |d2|d3| | q4 | q5 | q6 | q7 | |
| // +--+--+ - - - - +-----+-----+-----+-----+ |
| // |d4|d5| | q8 | q9 | q10 | q11 | |
| // |d4|d5| | q8 | q9 | q10 | q11 | |
| // |d4|d5| | q8 | q9 | q10 | q11 | |
| // |d4|d5| | q8 | q9 | q10 | q11 | |
| // +--+--+ - - - - +-----+-----+-----+-----+ |
| // |d6|d7| | q12 | q13 | q14 | q15 | |
| // |d6|d7| | q12 | q13 | q14 | q15 | |
| // |d6|d7| | q12 | q13 | q14 | q15 | |
| // |d6|d7| | q12 | q13 | q14 | q15 | |
| // +--+--+ - - - - +-----+-----+-----+-----+ |
| // |
| // Accumulator |
| |
| // Load 1 Rhs cell of size 2x4 |
| "vld1.8 {d0}, [%[rhs_ptr]:64]!\n" |
| |
| // Load 3 Lhs cells of size 4x2 each |
| "vld1.8 {d2}, [%[lhs_ptr]:64]!\n" |
| "vld1.8 {d4}, [%[lhs_ptr]:64]!\n" |
| "vld1.8 {d6}, [%[lhs_ptr]:64]!\n" |
| |
| // Expand Lhs/Rhs cells to 16 bit. |
| "vmovl.u8 q0, d0\n" |
| "vmovl.u8 q1, d2\n" |
| "vmovl.u8 q2, d4\n" |
| "vmovl.u8 q3, d6\n" |
| |
| // Multiply-accumulate, level of depth 0 |
| "vmlal.u16 q4, d2, d0[0]\n" |
| "vmlal.u16 q5, d2, d0[1]\n" |
| "vmlal.u16 q6, d2, d0[2]\n" |
| "vmlal.u16 q7, d2, d0[3]\n" |
| "vmlal.u16 q8, d4, d0[0]\n" |
| "vmlal.u16 q9, d4, d0[1]\n" |
| "vmlal.u16 q10, d4, d0[2]\n" |
| "vmlal.u16 q11, d4, d0[3]\n" |
| "vmlal.u16 q12, d6, d0[0]\n" |
| "vmlal.u16 q13, d6, d0[1]\n" |
| "vmlal.u16 q14, d6, d0[2]\n" |
| "vmlal.u16 q15, d6, d0[3]\n" |
| |
| // Multiply-accumulate, level of depth 1 |
| "vmlal.u16 q4, d3, d1[0]\n" |
| "vmlal.u16 q5, d3, d1[1]\n" |
| "vmlal.u16 q6, d3, d1[2]\n" |
| "vmlal.u16 q7, d3, d1[3]\n" |
| "vmlal.u16 q8, d5, d1[0]\n" |
| "vmlal.u16 q9, d5, d1[1]\n" |
| "vmlal.u16 q10, d5, d1[2]\n" |
| "vmlal.u16 q11, d5, d1[3]\n" |
| "vmlal.u16 q12, d7, d1[0]\n" |
| "vmlal.u16 q13, d7, d1[1]\n" |
| "vmlal.u16 q14, d7, d1[2]\n" |
| "vmlal.u16 q15, d7, d1[3]\n" |
| |
| // Loop. Decrement loop index (depth) by 2, since we just handled 2 |
| // levels of depth (Kernel::kDepth=2). |
| "subs %[run_depth], #2\n" |
| "bne loop_NEONKernel12x4Depth2_%=\n" |
| |
| |
| |
| Packing code adapts to the format chosen by the kernel |
| ====================================================== |
| |
| As explained in doc/design.txt, gemmlowp starts by packing blocks of the |
| lhs and rhs matrices for optimally efficient traversal by the kernel. This |
| depends on fine details of the kernel format, in ways that can only be |
| efficiently handled by knowing these kernel format details at compile-time. |
| |
| This is the reason why all the code in internal/pack.h is templated in |
| the corresponding kernel format. |
| |
| The code in internal/pack.h isn't tightly optimized by itself, but it is |
| structured in such a way that the critical code is in a template, |
| PackingRegisterBlock, |
| that can easily be specialized to override the slow generic code with |
| fast specific packing code for specific formats, on specific platforms. |
| |
| See internal/pack_neon.h which provides NEON specializations of the |
| packing code for the particular kernel formats that are used by the NEON |
| kernels in internal/kernel_neon.h. |
| |
| |
| Wrapping up: how to optimize gemmlowp for a CPU architecture |
| ============================================================ |
| |
| In conclusion, the key feature of gemmlowp when it comes to efficiently |
| supporting a specific CPU architecture, is that it allows to freely replace |
| the inner loop of the GEMM by providing one's own GEMM kernel, which is |
| also free to dictate its required data layout; each data layout then also |
| needs optimized packing code. The steps are thus: |
| 1) Freely design a GEMM kernel with a freely chosen data layout |
| 2) Implement the GEMM kernel, similar to internal/kernel_neon.h |
| 3) Implement the optimized packing code, similar to internal/pack_neon.h. |