Rebase gemmlowp to 36ffd29
am: a9fd919a00

Change-Id: Ie6a6983666b6968b845b64759185224ef9bfa388
diff --git a/.gitignore b/.gitignore
index 4ff62a0..28277a8 100644
--- a/.gitignore
+++ b/.gitignore
@@ -4,4 +4,7 @@
 **/.DS_Store
 ?
 ??
-???
+*binary*
+/.idea/
+CMakeLists.txt
+/bazel-*
diff --git a/doc/design.md b/doc/design.md
new file mode 100644
index 0000000..c924680
--- /dev/null
+++ b/doc/design.md
@@ -0,0 +1,165 @@
+# Overview of gemmlowp design
+
+## Primer on GEMM, kernels, and cache friendliness
+
+gemmlowp, like most GEMMs, implements the straightforward matrix multiplication
+algorithm, which takes n^3 multiply-accumulate instructions for n*n sized
+matrices. Because the arithmetic complexity grows quicker than the memory
+complexity (n^3 vs. n^2), memory accesses are redundant (each matrix entry is
+accessed n times). A large part of a GEMM's performance and design goes toward
+minimizing the inefficiency resulting from these redundant memory accesses.
+
+Ultimately, once values are loaded into CPU registers, they cost nothing to
+access, so as long as we can work within registers, this problem doesn't exist.
+Thus, in order to be efficient, a GEMM's inner loops must wisely use the
+available registers to do as much arithmetic work as possible before loading
+more data from memory into registers. This means that a GEMM implementation
+needs to have architecture-specific inner loops tailored for architecture
+details such as the number of registers, and typically written in assembly. This
+'inner loops' architecture-specific component is referred to as the GEMM kernel.
+(More details about kernels are in [kernel.md](kernel.md)).
+
+However, only small blocks can fit at a given time in registers, so at larger
+scales one needs to repeatedly load blocks of matrices from memory, and these
+accesses are redundant for the reason outlined above. The way that one minimizes
+the resulting inefficiency is by organizing for cache locality, so that most of
+these accesses hit the L1 cache, and most of the remaining ones hit the L2
+cache, etc.
+
+This is achieved by subdividing the matrices into blocks sized to fit in L2
+cache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
+and performing the matrix multiplication one such block at a time.
+
+In practice, it tends to pay off to "pack" input blocks for optimally efficient
+traversal by the kernel, since they will be traversed multiple times. "packing"
+means at least reordering the data layout for 1) simple access patterns that fit
+the CPU's cache behavior (in particular, the cache line size), and 2) simple
+loading into SIMD vector registers by the kernel.
+
+So a typical GEMM, in pseudo-code, tends to look like this:
+
+```
+allocate(some_lhs_L2_block);
+allocate(some_rhs_L2_block);
+for (some_lhs_L2_block) {
+  pack(some_lhs_L2_block);
+  for (some_rhs_L2_block) {
+    pack(some_rhs_L2_block);
+    for (some_lhs_sub_block in some_lhs_L2_block) {
+      for (some_rhs_sub_block in some_rhs_L2_block) {
+        kernel(some_lhs_sub_block, some_rhs_sub_block);
+      }
+    }
+  }
+}
+```
+
+## Impact of low-precision computation on gemmlowp design
+
+Refer to [low-precision.md](low-precision.md) for specifics of the
+low-precision-computation paradigm and how it's implemented in gemmlowp.
+
+Inputs and outputs are matrices of uint8 values, but internally we are
+accumulating int32 values, only converting them back to uint8 at the end. This
+means that we need so store a block of int32 accumulators at a time. We compute
+a block of the result in int32 accumulators and then we "unpack" it into the
+destination matrix at once. In this way, we minimize the amount of memory used
+to store int32 values at a given time.
+
+Because of that, besides the "pack" and "kernel" stages outlined above, a third
+stage is needed in gemmlowp, which we call "unpack". Thus we arrive at the
+3-stage computation scheme that gemmlowp uses:
+
+1.  Pack lhs/rhs blocks from the input matrices.
+2.  Compute the product of the packed blocks, using the kernel.
+3.  Unpack the result block into the output matrix.
+
+The pseudo-code overview of gemmlowp now looks like:
+
+```
+allocate(some_lhs_L2_block);
+allocate(some_rhs_L2_block);
+// new: temp storage for int32 accums
+allocate(some_int32_accumulators_block);
+for (some_lhs_L2_block) {
+  pack(some_lhs_L2_block);
+  for (some_rhs_L2_block) {
+    pack(some_rhs_L2_block);
+    for (some_lhs_sub_block in some_lhs_L2_block) {
+      for (some_rhs_sub_block in some_rhs_L2_block) {
+        // new: pass int32 accums to kernel
+        kernel(&some_int32_accumulators_block,
+               some_lhs_sub_block,
+               some_rhs_sub_block);
+      }
+    }
+    // new: unpack int32 accums into destination matrix
+    unpack(some_int32_accumulators_block);
+  }
+}
+```
+
+## Exploring gemmlowp code
+
+The design outlined above can be readily matched to gemmlowp source code, in
+particular in this file, which gives a simple GEMM implementation fitting in one
+rather small function:
+
+```
+internal/single_thread_gemm.h
+```
+
+The reader can compare the above pseudo-code to the actual code in this file:
+
+```
+for (int r = 0; r < rows; r += block_params.l2_rows) {
+  int rs = std::min(block_params.l2_rows, rows - r);
+
+  PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));
+
+  for (int c = 0; c < cols; c += block_params.l2_cols) {
+    int cs = std::min(block_params.l2_cols, cols - c);
+
+    if (!pack_rhs_once) {
+      PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
+    }
+
+    Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);
+
+    auto result_block = result->block(r, c, rs, cs);
+    UnpackResult(&result_block, packed_result, packed_lhs, packed_rhs, depth,
+                 result_offset, result_mult_int, result_shift);
+  }
+}
+```
+
+The files in `internal/` fall into a few categories:
+
+There are two top-level GEMM implementations,
+
+*   [internal/single_thread_gemm.h](../internal/single_thread_gemm.h)
+*   [internal/multi_thread_gemm.h](../internal/multi_thread_gemm.h)
+
+They both call into pack/compute/unpack stages (see [kernel.md](kernel.md) and
+[packing.md](packing.md)) implemented in the following files:
+
+*   [internal/pack.h](../internal/pack.h)
+*   [internal/compute.h](../internal/compute.h)
+*   [internal/unpack.h](../internal/unpack.h)
+    *   This in turn calls into [internal/output.h](../internal/output.h) for
+        the output pipeline (see [output.md](output.md))
+
+The pack.h and unpack.h files contain generic templated code that can be
+overridden by optimized code in template specializations; for example, see the
+NEON optimized code here:
+
+*   [internal/pack_neon.h](../internal/pack_neon.h)
+*   [internal/unpack_neon.h](../internal/unpack_neon.h)
+    *   This in turn calls into
+        [internal/output_neon.h](../internal/output_neon.h)
+
+The compute stage contains generic code in compute.h that only calls into
+optimized code through the Kernel::Run() entry point. Each kernel is basically
+just as struct offering a Run() implementation; see the NEON kernels in:
+
+*   [internal/kernel_neon.h](../internal/kernel_neon.h)
diff --git a/doc/design.txt b/doc/design.txt
new file mode 100644
index 0000000..cb78dbb
--- /dev/null
+++ b/doc/design.txt
@@ -0,0 +1,158 @@
+                        Overview of gemmlowp design
+                        ***************************
+
+
+Primer on GEMM, kernels, and cache friendliness
+===============================================
+
+gemmlowp, like most GEMMs, implements the straightforward matrix multiplication
+algorithm, which takes n^3 multiply-accumulate instructions for n*n sized
+matrices. Because the arithmetic complexity grows quicker than the memory
+complexity (n^3 vs. n^2), memory accesses are redundant (each matrix entry
+is accessed n times). A large part of a GEMM's performance and design goes
+toward minimizing the inefficiency resulting from these redundant memory
+accesses.
+
+Ultimately, once values are loaded into CPU registers, they cost nothing to
+access, so as long as we can work within registers, this problem doesn't exist.
+Thus, in order to be efficient, a GEMM's inner loops must wisely use the
+available registers to do as much arithmetic work as possible before loading
+more data from memory into registers. This means that
+a GEMM implementation needs to have architecture-specific inner loops tailored
+for architecture details such as the number of registers, and typically written
+in assembly. This 'inner loops' architecture-specific component is referred to
+as the GEMM kernel. (More details about kernels are in doc/kernels.txt).
+
+However, only small blocks can fit at a given time in registers, so at larger
+scales one needs to repeatedly load blocks of matrices from memory, and
+these accesses are redundant for the reason outlined above. The way that
+one minimizes the resulting inefficiency is by organizing for cache locality,
+so that most of these accesses hit the L1 cache, and most of the remaining
+ones hit the L2 cache, etc.
+
+This is achieved by subdividing the matrices into blocks sized to fit in L2
+cache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
+and performing the matrix multiplication one such block at a time.
+
+In practice, it tends to pay off to "pack" input blocks for optimally
+efficient traversal by the kernel, since they will be traversed multiple times.
+"packing" means at least reordering the data layout for 1) simple access
+patterns that fit the CPU's cache behavior (in particular, the cache line size),
+and 2) simple loading into SIMD vector registers by the kernel.
+
+So a typical GEMM, in pseudo-code, tends to look like this:
+
+allocate(some_lhs_L2_block);
+allocate(some_rhs_L2_block);
+for (some_lhs_L2_block) {
+  pack(some_lhs_L2_block);
+  for (some_rhs_L2_block) {
+    pack(some_rhs_L2_block);
+    for (some_lhs_sub_block in some_lhs_L2_block) {
+      for (some_rhs_sub_block in some_rhs_L2_block) {
+        kernel(some_lhs_sub_block, some_rhs_sub_block);
+      }
+    }
+  }
+}
+
+
+Impact of low-precision computation on gemmlowp design
+======================================================
+
+Refer to doc/low-precision.txt for specifics of the low-precision-computation
+paradigm and how it's implemented in gemmlowp.
+
+Inputs and outputs are matrices of uint8 values, but internally we are
+accumulating int32 values, only converting them back to uint8 at the end. This
+means that we need so store a block of int32 accumulators at a time. We compute
+a block of the result in int32 accumulators and then we "unpack" it into the
+destination matrix at once. In this way, we minimize the amount of memory used to
+store int32 values at a given time.
+
+Because of that, besides the "pack" and "kernel" stages outlined above, a third
+stage is needed in gemmlowp, which we call "unpack". Thus we arrive at the
+3-stage computation scheme that gemmlowp uses:
+
+  1. Pack lhs/rhs blocks from the input matrices.
+  2. Compute the product of the packed blocks, using the kernel.
+  3. Unpack the result block into the output matrix.
+
+The pseudo-code overview of gemmlowp now looks like:
+
+allocate(some_lhs_L2_block);
+allocate(some_rhs_L2_block);
+// new: temp storage for int32 accums
+allocate(some_int32_accumulators_block);
+for (some_lhs_L2_block) {
+  pack(some_lhs_L2_block);
+  for (some_rhs_L2_block) {
+    pack(some_rhs_L2_block);
+    for (some_lhs_sub_block in some_lhs_L2_block) {
+      for (some_rhs_sub_block in some_rhs_L2_block) {
+        // new: pass int32 accums to kernel
+        kernel(&some_int32_accumulators_block,
+               some_lhs_sub_block,
+               some_rhs_sub_block);
+      }
+    }
+    // new: unpack int32 accums into destination matrix
+    unpack(some_int32_accumulators_block);
+  }
+}
+
+
+Exploring gemmlowp code
+=======================
+
+The design outlined above can be readily matched to gemmlowp source code,
+in particular in this file, which gives a simple GEMM implementation fitting in
+one rather small function:
+
+  internal/single_thread_gemm.h
+
+The reader can compare the above pseudo-code to the actual code in this file:
+
+  for (int r = 0; r < rows; r += block_params.l2_rows) {
+    int rs = std::min(block_params.l2_rows, rows - r);
+
+    PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));
+
+    for (int c = 0; c < cols; c += block_params.l2_cols) {
+      int cs = std::min(block_params.l2_cols, cols - c);
+
+      if (!pack_rhs_once) {
+        PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
+      }
+
+      Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);
+
+      auto result_block = result->block(r, c, rs, cs);
+      UnpackResult(&result_block, packed_result, packed_lhs, packed_rhs, depth,
+                   result_offset, result_mult_int, result_shift);
+    }
+  }
+
+The files in internal/ fall into a few categories:
+
+There are two top-level GEMM implementations,
+  single_thread_gemm.h
+  multi_thread_gemm.h
+
+They both call into pack/compute/unpack stages implemented in the following files:
+  pack.h
+  compute.h
+  unpack.h
+
+The pack.h and unpack.h files contain generic templated code that can be overridden
+by optimized code in template specializations; see the NEON optimized code here:
+  pack_neon.h
+  unpack_neon.h
+
+The compute stage contains generic code in compute.h that only calls into
+optimized code through the Kernel::Run() entry point. Each kernel is basically just
+as struct offering a Run() implementation; see the NEON kernels in:
+  kernel_neon.h
+
+More details about the interplay between these components can be found in this file:
+  doc/kernels.txt
diff --git a/doc/kernel.md b/doc/kernel.md
new file mode 100644
index 0000000..261cb92
--- /dev/null
+++ b/doc/kernel.md
@@ -0,0 +1,172 @@
+# Kernels in gemmlowp
+
+## Kernels provide an inner-loop implementation, and a format
+
+Here we assume familiarity with the concepts of kernels and of packing as
+explained in [design.md](design.md).
+
+gemmlowp is designed to be easily extensible to different architectures and
+other low-level details, while achieving high performance. Thus a line had to be
+drawn between the generic GEMM code and the specific parts that need to be
+manually designed for each architecture, etc. The design choice made in gemmlowp
+is to have easily swappable GEMM kernels.
+
+In itself, a GEMM kernel is just an implementation of the inner-most loop in a
+GEMM (That inner-most loop has to be over the 'depth' dimension so as to be able
+to accumulate into a small enough number of accumulators to fit in registers).
+
+Thus, by itself, a GEMM kernel should be just a function computing a block of
+GEMM.
+
+However, GEMM kernels may need to differ not just in how they implement this
+computation, but also in the format of data that they operate on. Indeed, in
+order to maximize the ratio of arithmetic instructions to memory access
+instructions, GEMM kernels want to handle blocks as wide as possible given the
+number of registers of the CPU architecture.
+
+Thus, in order to allow efficient specialization to diverse architectures,
+gemmlowp allows each GEMM kernel to dictate the format of data that it expects,
+in addition to providing its inner-loop implementation.
+
+The former is given by a 'Format' typedef, and the latter by a 'Run' method.
+
+A good example is to look at internal/kernel_neon.h, and specifically at the
+NEONKernel12x4Depth2 kernel, which specifies its format as
+
+```
+  typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, 3>,
+                       KernelSideFormat<CellFormat<4, 2>, 1> > Format;
+```
+
+The meaning of these terms is explained in the lengthy comment at the top of
+internal/kernel.h. Here, they mean that this kernel handles at each iteration
+(along the depth dimension): - 3 'cells' of size 4x2 each of the lhs, so a total
+lhs block of size 12x2 - 1 'cell' of size 2x4 of the rhs. In other words, this
+kernel handles 12 rows of the lhs and 4 columns of the rhs, and handles two
+levels of depth at once. The 'cells' and `CellFormat` detail the layout of these
+12x2 and 2x4 blocks.
+
+This kernel then loads these 12x2 and 2x4 blocks and computes the corresponding
+12x4 GEMM; for ease of reference let us paste the critical comment and code
+here:
+
+```
+"loop_NEONKernel12x4Depth2_%=:\n"
+
+// Overview of register layout:
+//
+// A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0).
+// A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in d2--d7
+// (q1--q3).
+// A 12x4 block of accumulators is stored in 32bit in q4--q15.
+//
+//                   +-----+-----+-----+-----+
+//                   |d0[0]|d0[1]|d0[2]|d0[3]|
+//              Rhs  +-----+-----+-----+-----+
+//                   |d1[0]|d1[1]|d1[2]|d1[3]|
+//                   +-----+-----+-----+-----+
+//
+//                   |     |     |     |     |
+//
+//    Lhs            |     |     |     |     |
+//
+//  +--+--+ - - - -  +-----+-----+-----+-----+
+//  |d2|d3|          | q4  | q5  | q6  | q7  |
+//  |d2|d3|          | q4  | q5  | q6  | q7  |
+//  |d2|d3|          | q4  | q5  | q6  | q7  |
+//  |d2|d3|          | q4  | q5  | q6  | q7  |
+//  +--+--+ - - - -  +-----+-----+-----+-----+
+//  |d4|d5|          | q8  | q9  | q10 | q11 |
+//  |d4|d5|          | q8  | q9  | q10 | q11 |
+//  |d4|d5|          | q8  | q9  | q10 | q11 |
+//  |d4|d5|          | q8  | q9  | q10 | q11 |
+//  +--+--+ - - - -  +-----+-----+-----+-----+
+//  |d6|d7|          | q12 | q13 | q14 | q15 |
+//  |d6|d7|          | q12 | q13 | q14 | q15 |
+//  |d6|d7|          | q12 | q13 | q14 | q15 |
+//  |d6|d7|          | q12 | q13 | q14 | q15 |
+//  +--+--+ - - - -  +-----+-----+-----+-----+
+//
+//                            Accumulator
+
+// Load 1 Rhs cell of size 2x4
+"vld1.8 {d0}, [%[rhs_ptr]:64]!\n"
+
+// Load 3 Lhs cells of size 4x2 each
+"vld1.8 {d2}, [%[lhs_ptr]:64]!\n"
+"vld1.8 {d4}, [%[lhs_ptr]:64]!\n"
+"vld1.8 {d6}, [%[lhs_ptr]:64]!\n"
+
+// Expand Lhs/Rhs cells to 16 bit.
+"vmovl.u8 q0, d0\n"
+"vmovl.u8 q1, d2\n"
+"vmovl.u8 q2, d4\n"
+"vmovl.u8 q3, d6\n"
+
+// Multiply-accumulate, level of depth 0
+"vmlal.u16 q4, d2, d0[0]\n"
+"vmlal.u16 q5, d2, d0[1]\n"
+"vmlal.u16 q6, d2, d0[2]\n"
+"vmlal.u16 q7, d2, d0[3]\n"
+"vmlal.u16 q8, d4, d0[0]\n"
+"vmlal.u16 q9, d4, d0[1]\n"
+"vmlal.u16 q10, d4, d0[2]\n"
+"vmlal.u16 q11, d4, d0[3]\n"
+"vmlal.u16 q12, d6, d0[0]\n"
+"vmlal.u16 q13, d6, d0[1]\n"
+"vmlal.u16 q14, d6, d0[2]\n"
+"vmlal.u16 q15, d6, d0[3]\n"
+
+// Multiply-accumulate, level of depth 1
+"vmlal.u16 q4, d3, d1[0]\n"
+"vmlal.u16 q5, d3, d1[1]\n"
+"vmlal.u16 q6, d3, d1[2]\n"
+"vmlal.u16 q7, d3, d1[3]\n"
+"vmlal.u16 q8, d5, d1[0]\n"
+"vmlal.u16 q9, d5, d1[1]\n"
+"vmlal.u16 q10, d5, d1[2]\n"
+"vmlal.u16 q11, d5, d1[3]\n"
+"vmlal.u16 q12, d7, d1[0]\n"
+"vmlal.u16 q13, d7, d1[1]\n"
+"vmlal.u16 q14, d7, d1[2]\n"
+"vmlal.u16 q15, d7, d1[3]\n"
+
+// Loop. Decrement loop index (depth) by 2, since we just handled 2
+// levels of depth (Kernel::kDepth=2).
+"subs %[run_depth], #2\n"
+"bne loop_NEONKernel12x4Depth2_%=\n"
+```
+
+## Packing code adapts to the format chosen by the kernel
+
+As explained in [design.md](design.md), gemmlowp starts by packing blocks of the
+lhs and rhs matrices for optimally efficient traversal by the kernel. This
+depends on fine details of the kernel format, in ways that can only be
+efficiently handled by knowing these kernel format details at compile-time.
+
+This is the reason why all the code in [internal/pack.h](../internal/pack.h) is
+templated in the corresponding kernel format.
+
+The code in internal/pack.h isn't tightly optimized by itself, but it is
+structured in such a way that the critical code is in a template,
+`PackingRegisterBlock`, that can easily be specialized to override the slow
+generic code with fast specific packing code for specific formats, on specific
+platforms.
+
+See [internal/pack_neon.h](../internal/pack_neon.h) which provides NEON
+specializations of the packing code for the particular kernel formats that are
+used by the NEON kernels in [internal/kernel_neon.h](../internal/kernel_neon.h).
+
+## Wrapping up: how to optimize gemmlowp for a CPU architecture
+
+In conclusion, the key feature of gemmlowp when it comes to efficiently
+supporting a specific CPU architecture, is that it allows to freely replace the
+inner loop of the GEMM by providing one's own GEMM kernel, which is also free to
+dictate its required data layout; each data layout then also needs optimized
+packing code. The steps are thus:
+
+1.  Freely design a GEMM kernel with a freely chosen data layout.
+2.  Implement the GEMM kernel, similar to
+    [internal/kernel_neon.h](../internal/kernel_neon.h).
+3.  Implement the optimized packing code, similar to
+    [internal/pack_neon.h](../internal/pack_neon.h).
diff --git a/doc/kernels.txt b/doc/kernels.txt
new file mode 100644
index 0000000..43dcb40
--- /dev/null
+++ b/doc/kernels.txt
@@ -0,0 +1,176 @@
+                        Kernels in gemmlowp
+                        *******************
+
+
+Kernels provide an inner-loop implementation, and a format
+==========================================================
+
+Here we assume familiarity with the concepts of kernels and of packing
+as explained in doc/design.txt.
+
+gemmlowp is designed to be easily extensible to different architectures and
+other low-level details, while achieving high performance. Thus a line had to
+be drawn between the generic GEMM code and the specific parts that need to
+be manually designed for each architecture, etc. The design choice made in
+gemmlowp is to have easily swappable GEMM kernels.
+
+In itself, a GEMM kernel is just an implementation of the inner-most loop
+in a GEMM (That inner-most loop has to be over the 'depth' dimension so as
+to be able to accumulate into a small enough number of accumulators to fit
+in registers).
+
+Thus, by itself, a GEMM kernel should be just a function computing a block
+of GEMM.
+
+However, GEMM kernels may need to differ not just in how they implement this
+computation, but also in the format of data that they operate on. Indeed,
+in order to maximize the ratio of arithmetic instructions to memory access
+instructions, GEMM kernels want to handle blocks as wide as possible given
+the number of registers of the CPU architecture.
+
+Thus, in order to allow efficient specialization to diverse architectures,
+gemmlowp allows each GEMM kernel to dictate the format of data that it expects,
+in addition to providing its inner-loop implementation.
+
+The former is given by a 'Format' typedef, and the latter by a 'Run'
+method.
+
+A good example is to look at internal/kernel_neon.h, and specifically at
+the NEONKernel12x4Depth2 kernel, which specifies its format as
+
+  typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, 3>,
+                       KernelSideFormat<CellFormat<4, 2>, 1> > Format;
+
+The meaning of these terms is explained in the lengthy comment at the
+top of internal/kernel.h. Here, they mean that this kernel handles at
+each iteration (along the depth dimension):
+  - 3 'cells' of size 4x2 each of the lhs, so a total lhs block
+    of size 12x2
+  - 1 'cell' of size 2x4 of the rhs.
+In other words, this kernel handles 12 rows of the lhs and 4 columns of the
+rhs, and handles two levels of depth at once. The 'cells' and 'CellFormat'
+details the layout of these 12x2 and 2x4 blocks.
+
+This kernel then loads these 12x2 and 2x4 blocks and computes the corresponding
+12x4 GEMM; for ease of reference let us paste the critical comment and code here:
+
+      "loop_NEONKernel12x4Depth2_%=:\n"
+
+        // Overview of register layout:
+        //
+        // A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0).
+        // A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in d2--d7
+        // (q1--q3).
+        // A 12x4 block of accumulators is stored in 32bit in q4--q15.
+        //
+        //                   +-----+-----+-----+-----+
+        //                   |d0[0]|d0[1]|d0[2]|d0[3]|
+        //              Rhs  +-----+-----+-----+-----+
+        //                   |d1[0]|d1[1]|d1[2]|d1[3]|
+        //                   +-----+-----+-----+-----+
+        //
+        //                   |     |     |     |     |
+        //
+        //    Lhs            |     |     |     |     |
+        //
+        //  +--+--+ - - - -  +-----+-----+-----+-----+
+        //  |d2|d3|          | q4  | q5  | q6  | q7  |
+        //  |d2|d3|          | q4  | q5  | q6  | q7  |
+        //  |d2|d3|          | q4  | q5  | q6  | q7  |
+        //  |d2|d3|          | q4  | q5  | q6  | q7  |
+        //  +--+--+ - - - -  +-----+-----+-----+-----+
+        //  |d4|d5|          | q8  | q9  | q10 | q11 |
+        //  |d4|d5|          | q8  | q9  | q10 | q11 |
+        //  |d4|d5|          | q8  | q9  | q10 | q11 |
+        //  |d4|d5|          | q8  | q9  | q10 | q11 |
+        //  +--+--+ - - - -  +-----+-----+-----+-----+
+        //  |d6|d7|          | q12 | q13 | q14 | q15 |
+        //  |d6|d7|          | q12 | q13 | q14 | q15 |
+        //  |d6|d7|          | q12 | q13 | q14 | q15 |
+        //  |d6|d7|          | q12 | q13 | q14 | q15 |
+        //  +--+--+ - - - -  +-----+-----+-----+-----+
+        //
+        //                            Accumulator
+
+        // Load 1 Rhs cell of size 2x4
+        "vld1.8 {d0}, [%[rhs_ptr]:64]!\n"
+
+        // Load 3 Lhs cells of size 4x2 each
+        "vld1.8 {d2}, [%[lhs_ptr]:64]!\n"
+        "vld1.8 {d4}, [%[lhs_ptr]:64]!\n"
+        "vld1.8 {d6}, [%[lhs_ptr]:64]!\n"
+
+        // Expand Lhs/Rhs cells to 16 bit.
+        "vmovl.u8 q0, d0\n"
+        "vmovl.u8 q1, d2\n"
+        "vmovl.u8 q2, d4\n"
+        "vmovl.u8 q3, d6\n"
+
+        // Multiply-accumulate, level of depth 0
+        "vmlal.u16 q4, d2, d0[0]\n"
+        "vmlal.u16 q5, d2, d0[1]\n"
+        "vmlal.u16 q6, d2, d0[2]\n"
+        "vmlal.u16 q7, d2, d0[3]\n"
+        "vmlal.u16 q8, d4, d0[0]\n"
+        "vmlal.u16 q9, d4, d0[1]\n"
+        "vmlal.u16 q10, d4, d0[2]\n"
+        "vmlal.u16 q11, d4, d0[3]\n"
+        "vmlal.u16 q12, d6, d0[0]\n"
+        "vmlal.u16 q13, d6, d0[1]\n"
+        "vmlal.u16 q14, d6, d0[2]\n"
+        "vmlal.u16 q15, d6, d0[3]\n"
+
+        // Multiply-accumulate, level of depth 1
+        "vmlal.u16 q4, d3, d1[0]\n"
+        "vmlal.u16 q5, d3, d1[1]\n"
+        "vmlal.u16 q6, d3, d1[2]\n"
+        "vmlal.u16 q7, d3, d1[3]\n"
+        "vmlal.u16 q8, d5, d1[0]\n"
+        "vmlal.u16 q9, d5, d1[1]\n"
+        "vmlal.u16 q10, d5, d1[2]\n"
+        "vmlal.u16 q11, d5, d1[3]\n"
+        "vmlal.u16 q12, d7, d1[0]\n"
+        "vmlal.u16 q13, d7, d1[1]\n"
+        "vmlal.u16 q14, d7, d1[2]\n"
+        "vmlal.u16 q15, d7, d1[3]\n"
+
+        // Loop. Decrement loop index (depth) by 2, since we just handled 2
+        // levels of depth (Kernel::kDepth=2).
+        "subs %[run_depth], #2\n"
+        "bne loop_NEONKernel12x4Depth2_%=\n"
+
+
+
+Packing code adapts to the format chosen by the kernel
+======================================================
+
+As explained in doc/design.txt, gemmlowp starts by packing blocks of the
+lhs and rhs matrices for optimally efficient traversal by the kernel. This
+depends on fine details of the kernel format, in ways that can only be
+efficiently handled by knowing these kernel format details at compile-time.
+
+This is the reason why all the code in internal/pack.h is templated in
+the corresponding kernel format.
+
+The code in internal/pack.h isn't tightly optimized by itself, but it is
+structured in such a way that the critical code is in a template,
+  PackingRegisterBlock,
+that can easily be specialized to override the slow generic code with
+fast specific packing code for specific formats, on specific platforms.
+
+See internal/pack_neon.h which provides NEON specializations of the
+packing code for the particular kernel formats that are used by the NEON
+kernels in internal/kernel_neon.h.
+
+
+Wrapping up: how to optimize gemmlowp for a CPU architecture
+============================================================
+
+In conclusion, the key feature of gemmlowp when it comes to efficiently
+supporting a specific CPU architecture, is that it allows to freely replace
+the inner loop of the GEMM by providing one's own GEMM kernel, which is
+also free to dictate its required data layout; each data layout then also
+needs optimized packing code. The steps are thus:
+  1) Freely design a GEMM kernel with a freely chosen data layout
+  2) Implement the GEMM kernel, similar to internal/kernel_neon.h
+  3) Implement the optimized packing code, similar to internal/pack_neon.h.
diff --git a/doc/less-than-8-bit.md b/doc/less-than-8-bit.md
new file mode 100644
index 0000000..07cc858
--- /dev/null
+++ b/doc/less-than-8-bit.md
@@ -0,0 +1,313 @@
+# Computation with less than 8 bits in gemmlowp
+
+## Introduction
+
+We assume familiarity with gemmlowp's low-precision uint8 computation paradigm,
+which is described in [low-precision.md](low-precision.md).
+
+This document is about the possibility of further reducing precision below 8
+bits.
+
+That allows to get higher arithmetic throughput on some architectures, at the
+cost of decreased accuracy.
+
+## The past, present, and future of less-than-8-bit computation in gemmlowp
+
+A meta note is needed here as to how this fits with the general gemmlowp design.
+
+### The past
+
+Less-than-8-bit computation was initially designed and implemented in gemmlowp
+as a drop-in replacement for regular 8bit computation, a plain optimization. The
+idea was that to automatically requantize 8bit operands to less-than-8bit during
+the O(N^2) packing stage, then take advantage of the lower bit depth during the
+O(N^3) compute stage. For large enough matrices, that should be worth it.
+
+### The present
+
+TODO(benoitjacob): update this documentation. This 'present' state just
+became the past (February 2017).
+
+At the moment, this less-than-8-bit mode of gemmlowp is not much used in
+practice, because the implicit requantization of operands from 8bit to
+less-than-8bit turned out to be more expensive than initially expected, both in
+terms of speed and accuracy:
+
+1.  Speed: the O(N^2) requantization is only negligible compared to the O(N^3)
+    compute kernel when the matrix size N is large enough; in practice, smaller
+    matrix sizes turned out to be very important, making the requantization
+    approach slower than expected.
+
+2.  Accuracy: As neural networks were optimized for size, their sensitivity to
+    numerical accuracy increased. Then the approach of requantizing
+    already-quantized data turned out to be more wasteful of accuracy than we
+    could afford.
+
+### The future
+
+Less-than-8bit still probably has good prospects; what should be dropped is the
+requantization. In other words, in the future, we might have neural networkds
+trained right away for some bit depth lower than 8 bits. The resulting values
+would probably still be stored as 8 bits (unless the bit depth eventually
+becomes very low). Thus, no particular work would be needed in the packing
+stage; no overhead or loss of accuracy would be incurred anymore.
+
+In other words: the design of less-than-8-bit kernels is probably useful in the
+long run; what is on the way out is requantization and packing/unpacking-stage
+aspects.
+
+With that said, the rest of this page retains its old content about the present
+approach:
+
+## Public interface
+
+### The BitDepthSetting parameter in the EightBitIntGemm interface
+
+Accessing less-than-8-bit computation via the EightBitIntGemm is very simple:
+EightBitIntGemm takes a BitDepthSetting enum which allows to choose among a
+fixed set of supported bit-depth combinations.
+
+### The BitDepthParams parameter in the public/gemmlowp.h interface
+
+The public/gemmlowp.h interface exposes more extensive control over
+quantization, by means of a BitDepthParams template parameter, which is a type
+parameter, carrying information about: 1. The LHS and RHS bit depth, which can
+be set arbitrarily and independently; 2. The 'RoundingStrategy', which is the
+heuristic used to choose a rounding mode, based on the accumulation size (a.k.a.
+the "depth" dimension of the Gemm). Details can be seen in public/bit_depth.h.
+
+### How does BitDepth{Setting,Params} affect input/output uint8 matrix data?
+
+Input/output matrix data is all uint8's, ranging from 0 to 255, regardless of
+the BitDepth{Setting,Params}.
+
+So the BitDepth{Setting,Params} is only an internal detail. It only means to
+allow gemmlowp to use lower precision internally, but the input/output data
+format is unaffected.
+
+As far as the API contract goes, the only thing that the
+BitDepth{Setting,Params} does is to relax the accuracy requirement. With
+standard 8bit/8bit computation, gemmlowp is required to return the exact result
+as specified in [low-precision.md](low-precision.md). With lower bit depths,
+gemmlowp is no longer required to return an exact result.
+
+## Implementation
+
+Here we refer to the 3 stages of computation as described in
+[design.md](design.md), namely: packing, computation kernel, unpacking.
+
+The general idea is that at the packing stage, we requantize input (Lhs/Rhs)
+data to less-than-8-bit depths by scaling them, thus shrinking the range of the
+packed matrix entries; for instance, if the Rhs bit depth is to be 5 bits then
+packed Rhs matrix entries will be in the range [0 ... 31]. This then allows the
+GEMM kernel to use narrower accumulators without risking overflow, thus
+achieving higher arithmetic throughput. Finally, at the unpacking stage, it only
+remains to scale the result values to compensate for the scalings applied
+earlier.
+
+Let us go into more detail for each of those stages:
+
+### Packing stage
+
+The packing stage is where most of the work specific to the BitDepthParams takes
+place.
+
+Here, we have to scale input matrix values from their original range of [0 ...
+255] to the range specified by the BitDepthParams, which is [0 ... (2^N)-1]
+where N is the number of bits for the matrix at hand (Lhs or Rhs). For example,
+for a bit depth of 5 bits, we need to scale down to [0 ... 31].
+
+This scaling is what we call "requantization". The pedantic name matches the
+fact that this is actually quite nontrivial to do correctly i.e. in such a way
+that the result accuracy will be good enough for real-world applications. See
+the section below on requantization details.
+
+Concretely, this work happens in PackingRegisterBlock::Pack(), which calls
+Requantize(). This is in internal/pack.h. This code can be overridden for
+specific architectures, see internal/pack_neon.h.
+
+This requantization work is costly and makes packing slower. This means that, at
+least in our approach, less-than-8-bit computation is only interesting for
+large-enough, square-enough GEMMs where packing is only a small fraction of the
+overall cost. In cases where packing overhead is more prevalent (highly
+rectangular cases), less-than-8-bit is probably a waste of time as long as we
+treat it as an internal computation detail. What might help there, might be if
+we shrunk the input/output data format to lower memory bandwidth usage.
+
+### Computation kernel stage
+
+In principle, the computation kernel stage simply doesn't have to care about the
+bit depth at all. In fact, on architectures where we do not have specific
+optimized kernels for less-than-8-bit cases, we simply use our standard kernel
+there, and that's correct!
+
+However, while the kernel doesn't have to know about the fact that the operands
+are on less than 8 bits, it can use that information to make special
+optimizations that would be incorrect in the general 8-bit case and become
+correct here thanks to the more restricted range of inputs. That's the whole
+point of this less-than-8-bit computation idea.
+
+With Lhs entries guaranteed to be smaller than 2^N, and Rhs entries guaranteed
+to be smaller than 2^M, each product is thus guaranteed to be smaller than
+2^(M+N). Thus, one may accumulate 2^(16-(M+N)) such products and still be
+guaranteed that such an accumulator will be smaller than 2^16, and thus can be
+stored as a uint16 without risking overflow.
+
+For example, in the L7R5 case, the Lhs enties are on 7 bits (N=7) and the Rhs
+entries are on 5 bits (M=5), so each product fits in 12 bits and one can thus
+accumulate 16 ( = 2^(16-12)) such products into uint16 accumulators with no risk
+of overflow.
+
+This means that a computation kernel may use uint16 accumulators for several
+loop iterations (16 in the above example), provided that it is allowed to assume
+that inputs are in such restricted range.
+
+After this fixed number of loop iterations, the kernel must accumulate the local
+uint16 accumulators back into global uint32 accumulators.
+
+On SIMD architectures with suitable uint16 arithmetic, this in principle allows
+to multiply arithmetic throughput by up to 2x, since twice more accumulators now
+fit in each SIMD vector register. This is partially offset by the cost of
+accumulating back into global uint32 accumulators every several loop iterations,
+but our experience on ARM NEON has been that we still get quite close to a 2x
+speedup. See internal/kernel_neon.h, specifically
+NEON32Kernel12x4Depth2Assuming12BitProducts.
+
+### Unpacking stage
+
+At the unpacking stage, it only remains to scale the result values to compensate
+for the scaling of the inputs. This is easier because now we are expanding the
+range instead of shrinking it, so we don't need to worry about ways to minimize
+a loss of accuracy. We simply need to multiply result values by a constant
+fraction, rounding to nearest.
+
+Since the inputs were scaled by factors of (2^lhs_bits - 1)/255 and
+(2^rhs_bits - 1)/255 respectively, the scaling of the outputs needs to be by the
+following factor:
+
+                 255 * 255
+    -----------------------------------
+    (2^lhs_bits - 1) * (2^rhs_bits - 1)
+
+This is done by a MultiplyByConstantFraction function, see internal/unpack.h
+
+## Requantization details
+
+Here we go into more detail on the Requantize() function used at the packing
+stage to requantize input matrix data. See this function in internal/pack.h.
+
+It depends on the bit depth and on a rounding mode, and requantizes an input
+value in [0 ... 255] to the range [0 ... (2^N)-1] specified by the bit depth N.
+
+### Naive, bad rounding, that's plainly biased
+
+Naive and inaccurate ways to achieve this requantization include: 1. By shifting
+bits rights by (8-N) bits; 2. By multiplying by ((2^N) - 1) and dividing by 255.
+
+Both of those are biased in some way: 1. has the wrong "derivative", since it
+approximates (((2^N) - 1) / 255) by ((2^N) / 256) ; 2. has bias since it
+effectively implements rounding towards 0.
+
+In practice, both of the above requantization functions give results that are
+too inaccurate in practice for the neural network that we tried (GoogLeNet).
+
+### Round-to-nearest rounding: unbiased in principle but not in practice
+
+The simplest fix is to avoid the bias in 2. by rounding-to-nearest instead of
+rounding towards 0. This can be achieved by doing
+
+dst = (src * maxval + rounding_offset) / 255;
+
+Where maxval = ((2^N) - 1) is the highest requantized value, and the
+rounding_offset can be set to
+
+rounding_offset = 127
+
+to achieve rounding-to-nearest (while the above rounding towards 0 corresponded
+to rounding_offset = 0).
+
+In principle, rounding-to-nearest is unbiased and optimal in various ways.
+
+In practice though, our input data is not random real numbers, but
+already-quantized 8-bit values. That means that even in the best case, there
+would be at most 255 different possible input values; in practice, we generally
+see the input values distributed non-uniformly in that range, so that a majority
+of input values tend to be in a much smaller range. See test/test_data.cc.
+
+Having a large part of the input values in a very small finite set, means that
+the corresponding rounding errors are also in a very small finite set, which can
+be small enough that the mean of these rounding errors is significantly
+different from 0. That rounding-to-nearest is "unbiased" only means that over a
+sufficiently large set of input values, the bias would become arbitrarily close
+to 0; here, the set of input values is effectively small enough that the
+resulting bias is significant.
+
+This leads to biasing the matrix product entries, resulting in an error that
+grows linearly with the depth dimension of the GEMM.
+
+### Probabilistic rounding: unbiased even on small finite input distributions
+
+To address that, we can instead use probabilistic rounding. The idea is that for
+instance if we have to round the value 3.8 to the nearest integer, we can round
+it to 3 with 20% probability and to 4 with probability 80%. If that value 3.8
+occurs many times, the mean requantized value will thus tend to 3.8.
+
+This amounts to keeping the above requantization formula,
+
+dst = (src * maxval + rounding_offset) / 255;
+
+but now the rounding_offset is a random value in [0 .. 254].
+
+This guarantees zero bias no matter how small the distribution of input values
+is.
+
+On the other hand, the variance of the error term here is higher than with
+rounding-to-nearest --- one can check that it is 2x higher.
+
+So the error term coming from the Central Limit Theorem, which grows with the
+square root of the accumulator depth i.e. the GEMM depth, will be 2x higher
+here.
+
+Still, for large enough GEMM depth, that is better than rounding-to-nearest
+which has an error term growing linearly with the GEMM depth.
+
+### Switching between rounding-to-nearest and probabilistic rounding
+
+Thus, for fixed input values and bit depths, we expect that probabilistic
+rounding will give more accurate results for large enough GEMM depths, while
+rounding-to-nearest will be more accurate for smaller GEMM depths.
+
+That is why use switch between these rounding modes based on GEMM depth, see
+ChooseRoundingMode in internal/bit_depth_util.h.
+
+It is based on a constant, kProbabilisticRoundingThreshold, defined in
+internal/common.h and empirically determined. See the comment there. It would be
+nice to better understand the statistics here and come up with better heuristics
+for this switching.
+
+### Choice of pseudorandom number generator
+
+We provide two PRNGs. The first is an 8-bit Xorshift. It is fast, naturally
+produces values ranging over an interval of width 255, which is what we need
+here (as opposed to an interval of width 256), and turns out, from empirical
+tests, to produce better results than a linear congruential generator (LCG).
+That's unfortunate, as a 8-bit LCG performs better (we confirmed that on various
+ARM devices) but we need as perfect un-biased-ness as we can get.
+
+The second is an "add-mod" sequence generator, which generates non-random values
+in the sequence x = (x + 97) % 255. This generates a low-discrepancy sequence
+that minimizes the "clumpiness" of the random offsets (Thus, for example,
+quantizing a 3x3 matrix will have a maximum additive error of about 200 from the
+random offsets). While not random, this sequence performs well empirically for
+many quantizations. (For information about why 97 is a good value, see
+https://en.wikipedia.org/wiki/Low-discrepancy_sequence#Additive_recurrence and
+http://mollwollfumble.blogspot.com/2011/03/subrandom-numbers.html 97/255 = 0.38;
+0.382 is the best choice. For discrete numbers, the choice must be relatively
+prime to the modulus. 97 is prime, so it is safely relatively prime to 255. 107
+is another near-optimal choice.
+
+The low-discrepancy sequence generator is the default.
+
+More details and results are given in a comment on the default PRNG in
+internal/pack.h. Interested users can change the PRNG used by setting
+DefaultRoundingGenerator in bit_depth_util.h.
diff --git a/doc/less-than-8-bit.txt b/doc/less-than-8-bit.txt
new file mode 100644
index 0000000..1a1deaa
--- /dev/null
+++ b/doc/less-than-8-bit.txt
@@ -0,0 +1,305 @@
+         Computation with less than 8 bits in gemmlowp
+         *********************************************
+
+
+Introduction
+============
+
+We assume familiarity with gemmlowp's low-precision uint8 computation
+paradigm, which is described in doc/low-precision.txt.
+
+This document is about the possibility of further reducing precision
+below 8 bits.
+
+That allows to get higher arithmetic throughput on some architectures,
+at the cost of decreased accuracy.
+
+
+Public interface
+================
+
+
+The BitDepthSetting parameter in the EightBitIntGemm interface
+--------------------------------------------------------------
+
+Accessing less-than-8-bit computation via the EightBitIntGemm is very
+simple: EightBitIntGemm takes a BitDepthSetting enum
+which allows to choose among a fixed set of supported bit-depth
+combinations.
+
+
+The BitDepthParams parameter in the public/gemmlowp.h interface
+---------------------------------------------------------------
+
+The public/gemmlowp.h interface exposes more extensive control over
+quantization, by means of a BitDepthParams template parameter,
+which is a type parameter, carrying information about:
+  1. The LHS and RHS bit depth, which can be set arbitrarily and
+     independently;
+  2. The 'RoundingStrategy', which is the heuristic used to choose
+     a rounding mode, based on the accumulation size (a.k.a. the
+     "depth" dimension of the Gemm).
+Details can be seen in public/bit_depth.h.
+
+
+How does BitDepth{Setting,Params} affect input/output uint8 matrix data?
+-------------------------------------------------------------------
+
+Input/output matrix data is all uint8's, ranging from 0 to 255, regardless of
+the BitDepth{Setting,Params}.
+
+So the BitDepth{Setting,Params} is only an internal detail. It only means to
+allow gemmlowp to use lower precision internally, but the input/output data
+format is unaffected.
+
+As far as the API contract goes, the only thing that the
+BitDepth{Setting,Params} does is to relax the accuracy requirement.
+With standard 8bit/8bit computation, gemmlowp is required to return the exact
+result as specified in doc/low-precision.txt. With lower bit depths, gemmlowp
+is no longer required to return an exact result.
+
+
+Implementation
+==============
+
+Here we refer to the 3 stages of computation as described in doc/design.txt,
+namely: packing, computation kernel, unpacking.
+
+The general idea is that at the packing stage, we requantize input (Lhs/Rhs)
+data to less-than-8-bit depths by scaling them, thus shrinking the range of
+the packed matrix entries; for instance, if the Rhs bit depth is to be 5 bits
+then packed Rhs matrix entries will be in the range [0 ... 31]. This then
+allows the GEMM kernel to use narrower accumulators without risking overflow,
+thus achieving higher arithmetic throughput. Finally, at the unpacking stage,
+it only remains to scale the result values to compensate for the scalings
+applied earlier.
+
+Let us go into more detail for each of those stages:
+
+
+Packing stage
+-------------
+
+The packing stage is where most of the work specific to the BitDepthParams
+takes place.
+
+Here, we have to scale input matrix values from their original range of
+[0 ... 255] to the range specified by the BitDepthParams, which is
+[0 ... (2^N)-1] where N is the number of bits for the matrix at hand
+(Lhs or Rhs). For example, for a bit depth of 5 bits, we need to scale
+down to [0 ... 31].
+
+This scaling is what we call "requantization". The pedantic name matches
+the fact that this is actually quite nontrivial to do correctly i.e.
+in such a way that the result accuracy will be good enough for real-world
+applications. See the section below on requantization details.
+
+Concretely, this work happens in PackingRegisterBlock::Pack(), which calls
+Requantize(). This is in internal/pack.h. This code can be overridden for
+specific architectures, see internal/pack_neon.h.
+
+This requantization work is costly and makes packing slower. This means
+that, at least in our approach, less-than-8-bit computation is only
+interesting for large-enough, square-enough GEMMs where packing is only
+a small fraction of the overall cost. In cases where packing overhead
+is more prevalent (highly rectangular cases), less-than-8-bit is probably
+a waste of time as long as we treat it as an internal computation detail.
+What might help there, might be if we shrunk the input/output data format
+to lower memory bandwidth usage.
+
+
+Computation kernel stage
+------------------------
+
+In principle, the computation kernel stage simply doesn't have to care
+about the bit depth at all. In fact, on architectures where we do not have
+specific optimized kernels for less-than-8-bit cases, we simply use our
+standard kernel there, and that's correct!
+
+However, while the kernel doesn't have to know about the fact that the
+operands are on less than 8 bits, it can use that information to make
+special optimizations that would be incorrect in the general 8-bit case
+and become correct here thanks to the more restricted range of inputs.
+That's the whole point of this less-than-8-bit computation idea.
+
+With Lhs entries guaranteed to be smaller than 2^N, and Rhs entries
+guaranteed to be smaller than 2^M, each product is thus guaranteed to be
+smaller than 2^(M+N). Thus, one may accumulate 2^(16-(M+N)) such products
+and still be guaranteed that such an accumulator will be smaller than 2^16,
+and thus can be stored as a uint16 without risking overflow.
+
+For example, in the L7R5 case, the Lhs enties are on 7 bits (N=7) and the
+Rhs entries are on 5 bits (M=5), so each product fits in 12 bits and one can
+thus accumulate 16 ( = 2^(16-12)) such products into uint16 accumulators
+with no risk of overflow.
+
+This means that a computation kernel may use uint16 accumulators for
+several loop iterations (16 in the above example), provided that it is
+allowed to assume that inputs are in such restricted range.
+
+After this fixed number of loop iterations, the kernel must accumulate
+the local uint16 accumulators back into global uint32 accumulators.
+
+On SIMD architectures with suitable uint16 arithmetic, this in principle
+allows to multiply arithmetic throughput by up to 2x, since twice more
+accumulators now fit in each SIMD vector register. This is partially offset
+by the cost of accumulating back into global uint32 accumulators every
+several loop iterations, but our experience on ARM NEON has been that
+we still get quite close to a 2x speedup. See internal/kernel_neon.h,
+specifically NEON32Kernel12x4Depth2Assuming12BitProducts.
+
+
+Unpacking stage
+---------------
+
+At the unpacking stage, it only remains to scale the result values
+to compensate for the scaling of the inputs. This is easier because
+now we are expanding the range instead of shrinking it, so we don't
+need to worry about ways to minimize a loss of accuracy. We simply
+need to multiply result values by a constant fraction, rounding to nearest.
+
+Since the inputs were scaled by factors of (2^lhs_bits - 1)/255 and
+(2^rhs_bits - 1)/255 respectively, the scaling of the outputs needs to be
+by the following factor:
+
+                 255 * 255
+    -----------------------------------
+    (2^lhs_bits - 1) * (2^rhs_bits - 1)
+
+This is done by a MultiplyByConstantFraction function, see internal/unpack.h
+
+
+Requantization details
+======================
+
+Here we go into more detail on the Requantize() function used at the packing
+stage to requantize input matrix data. See this function in internal/pack.h.
+
+It depends on the bit depth and on a rounding mode, and requantizes an input
+value in [0 ... 255] to the range [0 ... (2^N)-1] specified by the bit depth N.
+
+
+Naive, bad rounding, that's plainly biased
+------------------------------------------
+
+Naive and inaccurate ways to achieve this requantization include:
+  1. By shifting bits rights by (8-N) bits;
+  2. By multiplying by ((2^N) - 1) and dividing by 255.
+
+Both of those are biased in some way: 1. has the wrong "derivative", since it
+approximates (((2^N) - 1) / 255) by ((2^N) / 256) ; 2. has bias since it
+effectively implements rounding towards 0.
+
+In practice, both of the above requantization functions give results that are
+too inaccurate in practice for the neural network that we tried (GoogLeNet).
+
+Round-to-nearest rounding: unbiased in principle but not in practice
+--------------------------------------------------------------------
+
+The simplest fix is to avoid the bias in 2. by rounding-to-nearest instead
+of rounding towards 0. This can be achieved by doing
+
+   dst = (src * maxval + rounding_offset) / 255;
+
+Where maxval = ((2^N) - 1) is the highest requantized value, and the
+rounding_offset can be set to
+
+  rounding_offset = 127
+
+to achieve rounding-to-nearest (while the above rounding towards 0
+corresponded to rounding_offset = 0).
+
+In principle, rounding-to-nearest is unbiased and optimal in various ways.
+
+In practice though, our input data is not random real numbers, but
+already-quantized 8-bit values. That means that even in the best case, there
+would be at most 255 different possible input values; in practice, we generally
+see the input values distributed non-uniformly in that range, so that a majority
+of input values tend to be in a much smaller range. See test/test_data.cc.
+
+Having a large part of the input values in a very small finite set, means that
+the corresponding rounding errors are also in a very small finite set, which
+can be small enough that the mean of these rounding errors is significantly
+different from 0. That rounding-to-nearest is "unbiased" only means that over
+a sufficiently large set of input values, the bias would become arbitrarily
+close to 0; here, the set of input values is effectively small enough that the
+resulting bias is significant.
+
+This leads to biasing the matrix product entries, resulting in an error that
+grows linearly with the depth dimension of the GEMM.
+
+
+Probabilistic rounding: unbiased even on small finite input distributions
+-------------------------------------------------------------------------
+
+To address that, we can instead use probabilistic rounding. The idea is that
+for instance if we have to round the value 3.8 to the nearest integer, we can
+round it to 3 with 20% probability and to 4 with probability 80%. If that value
+3.8 occurs many times, the mean requantized value will thus tend to 3.8.
+
+This amounts to keeping the above requantization formula,
+
+   dst = (src * maxval + rounding_offset) / 255;
+
+but now the rounding_offset is a random value in [0 .. 254].
+
+This guarantees zero bias no matter how small the distribution of input values
+is.
+
+On the other hand, the variance of the error term here is higher than with
+rounding-to-nearest --- one can check that it is 2x higher.
+
+So the error term coming from the Central Limit Theorem, which grows with 
+the square root of the accumulator depth i.e. the GEMM depth,
+will be 2x higher here.
+
+Still, for large enough GEMM depth, that is better than rounding-to-nearest
+which has an error term growing linearly with the GEMM depth.
+
+
+Switching between rounding-to-nearest and probabilistic rounding
+----------------------------------------------------------------
+
+Thus, for fixed input values and bit depths, we expect that probabilistic
+rounding will give more accurate results for large enough GEMM depths, while
+rounding-to-nearest will be more accurate for smaller GEMM depths.
+
+That is why use switch between these rounding modes based on GEMM depth,
+see ChooseRoundingMode in internal/bit_depth_util.h.
+
+It is based on a constant, kProbabilisticRoundingThreshold, defined
+in internal/common.h and empirically determined. See the comment there.
+It would be nice to better understand the statistics here and come up
+with better heuristics for this switching.
+
+
+Choice of pseudorandom number generator
+---------------------------------------
+We provide two PRNGs.  The first is an 8-bit Xorshift.
+It is fast, naturally produces values ranging
+over an interval of width 255, which is what we need here (as opposed
+to an interval of width 256), and turns out, from empirical tests,
+to produce better results than a linear congruential generator (LCG).
+That's unfortunate, as a 8-bit LCG performs better (we confirmed that
+on various ARM devices) but we need as perfect un-biased-ness as we can
+get. 
+
+The second is an "add-mod" sequence generator, which generates
+non-random values in the sequence x = (x + 97) % 255.  This
+generates a low-discrepancy sequence that minimizes the "clumpiness"
+of the random offsets (Thus, for example, quantizing a 3x3 matrix will
+have a maximum additive error of about 200 from the random offsets).
+While not random, this sequence performs well empirically for many
+quantizations.  (For information about why 97 is a good value, see
+https://en.wikipedia.org/wiki/Low-discrepancy_sequence#Additive_recurrence
+and http://mollwollfumble.blogspot.com/2011/03/subrandom-numbers.html
+97/255 = 0.38;  0.382 is the best choice.  For discrete numbers,
+the choice must be relatively prime to the modulus.  97 is prime,
+so it is safely relatively prime to 255.  107 is another near-optimal
+choice.
+
+The low-discrepancy sequence generator is the default.
+
+More details and results are given in a comment on the default
+PRNG in internal/pack.h.  Interested users can change the
+PRNG used by setting DefaultRoundingGenerator in bit_depth_util.h.
diff --git a/doc/low-precision.md b/doc/low-precision.md
new file mode 100644
index 0000000..97b1498
--- /dev/null
+++ b/doc/low-precision.md
@@ -0,0 +1,192 @@
+# The low-precision paradigm in gemmlowp, and how it's implemented
+
+## Introduction
+
+"Low-precision" means that the input and output matrix entries are integers on
+at most 8 bits. The scalar type is uint8_t.
+
+This isn't the same as just doing plain matrix arithmetic over uint8_t, because
+that would overflow. To avoid overflow, we internally accumulate results on more
+than 8 bits, and at the end we keep only some significant 8 bits. This relies on
+the caller providing suitable offset/multiplier/shift parameters, which
+effectively govern how we extract some significant 8 bit from our more-than-8bit
+temporary accumulators.
+
+## Low-precision paradigms
+
+gemmlowp is flexible enough to support multiple low-precision paradigms, i.e.
+multiple ways that a meaning is attached to 8bit values so that a computation
+can rely on a 8bit GEMM provided by gemmlowp.
+
+### The current flexible design with arbitrary "output pipelines".
+
+See [output.md](output.md) for more details about output pipelines. This is a
+mechanism by which gemmlowp becomes generic enough to support multiple 8bit
+computation paradigms, by allowing the user to set up a chain of transformations
+to be performed on internal 32bit accumulators to obtain the final outputs.
+
+The public entry point in [public/gemmlowp.h](../public/gemmlowp.h) allowing to
+set un an arbitrary output pipeline is `GemmWithOutputPipeline`.
+
+Refer to [quantization.md](quantization.md) for details of how one gets from
+first principles to the actual output pipelines to assemble for successful
+real-world quantized calculations.
+
+For the scope of the present document, it suffices to say that quantized matrix
+multiplication takes the following parameters:
+
+-   The lhs matrix of uint8 quantized values.
+-   The rhs matrix of uint8 quantized values.
+-   A int32 lhs_offset, that will be added to each entry of the lhs matrix.
+-   A int32 rhs_offset, that will be added to each entry of the rhs matrix.
+-   An output pipeline, that will process int32 accumulators into final outputs.
+
+The overall computation goes through the following steps:
+
+1.  Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
+2.  Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
+3.  Compute the int32 matrix product of the resulting lhs times rhs.
+4.  Apply the output pipeline on these int32 accumulators, to obtain the final
+    outputs.
+
+### The legacy low-precision paradigm
+
+This older paradigm is the one exposed by the following entry points:
+
+*   In [public/gemmlowp.h](../public/gemmlowp.h), the `Gemm` entry point.
+*   The deprecateed `eight_bit_int_gemm` directory.
+
+Originally, gemmlowp started an implementation of the (now deprecated)
+EightBitIntGemm paradigm, where quantized matrix multiplication takes the
+following input parameters: - the lhs matrix of uint8 quantized values - the rhs
+matrix of uint8 quantized values - the following int32 "quantization
+parameters", which control how the uint8 quantized values in the matrices are to
+be interpreted during the matrix computation: - lhs_offset - rhs_offset -
+result_offset - result_mult_int - result_shift
+
+In that legacy paradigm, the mathematical expression to be computed is the
+result of the following steps:
+
+1.  Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
+2.  Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
+3.  Compute the int32 matrix product of the resulting lhs times rhs.
+4.  Add result_offset to each entry of the result.
+5.  Multiply each entry of the result by the following fraction, and round to
+    the nearest integer:
+
+```
+result_mult_int
+---------------                             (1)
+2^result_shift
+```
+
+1.  Clamp the resulting int32 values to the `[0..255]` range and cast to uint8.
+
+Again, this paradigm is not recommended for new usage. See
+[quantization.md](quantization.md) for how reasoning from first principles, one
+arrives to a substantially different quantization paradigm.
+
+In addition, note that the integer multiplication by the numerator in the above
+step 5. risks overflowing. That concern is avoided in the currently recommended
+output stages by performing a fixed-point multiplication instead of an ordinary
+integer multiplication.
+
+# Efficient handling of offsets
+
+At first glance it may seem like the above-described quantized computation
+scheme requires adding the lhs_offset and rhs_offset to each of the lhs and rhs
+matrix entries.
+
+Doing that in the GEMM kernel would incur substantial overhead: - It would mean
+extra arithmetic work in the GEMM kernel; - It would require storing the
+lhs_offset and rhs_offset in registers, which would eat into the register space
+available for the rest of the GEMM kernel.
+
+One may then consider adding the lhs_offset and rhs_offset once and for all to
+lhs and rhs blocks, in a GEMM implementation operating on one lhs block and one
+rhs block at a time. However, doing so would require storing lhs and rhs blocks
+in 32 bit (or at least in 16 bit in real-world cases), which would partially
+negate the memory bandwidth benefits of low-precision computation.
+
+Fortunately, there is another way to handle these offsets that has none of the
+costs of the approaches described above. The idea is as follows.
+
+Let `P` denote the matrix shaped like `lhs`, but filled with 1's.
+
+Let `Q` denote the matrix shaped like `rhs`, but filled with 1's.
+
+Adding lhs_offset to each entry of `lhs`, means adding `lhs_offset * P` to
+`lhs`.
+
+Adding rhs_offset to each entry of `rhs`, means adding `rhs_offset * Q` to
+`rhs`.
+
+Thus, as far as handling `lhs_offset` and `rhs_offset` goes, the matrix product
+to be computed is:
+
+```
+(lhs + lhs_offset * P) * (rhs + rhs_offset * Q)
+```
+
+Expanding this (using distributivity of matrix multiplication over addition), we
+see that the above product is equal to the following sum of 4 terms:
+
+```
+  lhs * rhs                                 (2)
++ lhs_offset * P * rhs
++ lhs * rhs_offset * Q
++ lhs_offset * rhs_offset * P * Q
+```
+
+The first term, `lhs * rhs`, is just the matrix multiplication ignoring the
+offsets, i.e. as if `lhs_offset==rhs_offset==0`. Our claim here is that this is
+all what we have to compute in the GEMM kernel.
+
+In the second term, `lhs_offset * P * rhs`, notice that since P is filled with
+1's, `P * rhs` has all its rows equal to each other, and equal to the row-vector
+of sums of all the entries in each column of rhs.
+
+Thus, we can compute the second term, `lhs_offset * P * rhs`, by summing each
+column of rhs. This produces a single row-vector, and in order to add the second
+term, we simply need to add this row-vector (multiplied by lhs_offset) to each
+row of the result. This is just a rank one update of the result (equivalently,
+the second term is a rank one matrix), and we can efficiently store it as a
+single vector.
+
+The third term, `lhs * rhs_offset * Q`, is entirely similar to the second one,
+and can be similarly computed by summing each row of lhs, storing this in a
+single column-vector, and later multiplying these sums by rhs_offset.
+
+The fourth term is a single constant, repeated into all the entries of the
+matrix. The matrix `P * Q` is filled with the single constant value 'depth' (the
+depth of the matrix product i.e. the number of columns of the lhs). Thus the
+fourth term is simply the rank zero update adding this constant to each matrix
+entry:
+
+```
+lhs_offset * rhs_offset * depth
+```
+
+# Implementation of this technique in gemmlowp
+
+In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
+to prepare them for efficient repeated traversal by the kernel), we compute the
+sum of each row of the lhs block and the sum of each column of the rhs block.
+
+See in [internal/pack.h](../internal/pack.h), in the PackedSideBlock class, the
+following member:
+
+```
+// Handle on the additional buffer backing the vector of sums of slices
+// associated with this block. Owned.
+Allocator::Handle sums_of_each_slice_handle_;
+```
+
+sums_of_each_slice_handle_ is the handle to the buffer allocated to store the
+vector containing sums of rows of lhs, or of sums of columns of rhs.
+
+After these rank one updates have been computed at the packing stage, they are
+ignored at the compute kernel stage, since that stage is only concerned with the
+first of the four terms in (2); they are only used at the unpacking stage. See
+the default/reference implementation, `UnpackResultImpl`, in
+[internal/unpack.h](../internal/unpack.h).
diff --git a/doc/low-precision.txt b/doc/low-precision.txt
new file mode 100644
index 0000000..893961f
--- /dev/null
+++ b/doc/low-precision.txt
@@ -0,0 +1,159 @@
+      The low-precision paradigm in gemmlowp, and how it's implemented
+      ****************************************************************
+
+
+Introduction
+============
+
+"Low-precision" means that the input and output matrix entries are integers
+on at most 8 bits. The scalar type is uint8_t.
+
+This isn't the same as just doing plain matrix arithmetic over uint8_t,
+because that would overflow. To avoid overflow, we internally accumulate
+results on more than 8 bits, and at the end we keep only some significant
+8 bits. This relies on the caller providing suitable offset/multiplier/shift
+parameters, which effectively govern how we extract some significant 8 bit
+from our more-than-8bit temporary accumulators.
+
+Gemmlowp supports further reducing precision below 8 bits. That is not
+the subject of this document; for that, refer to doc/less-than-8-bit.txt.
+
+
+The low-precision paradigm
+==========================
+
+gemmlowp is an implementation of the EightBitIntGemm paradigm, where quantized
+matrix multiplication takes the following input parameters:
+  - the lhs matrix of uint8 quantized values
+  - the rhs matrix of uint8 quantized values
+  - the following int32 "quantization parameters", which control how the
+    uint8 quantized values in the matrices are to be interpreted during the
+    matrix computation:
+    - lhs_offset
+    - rhs_offset
+    - result_offset
+    - result_mult_int
+    - result_shift
+
+The mathematical expression to be computed is the result of the following steps:
+  1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
+  2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
+  3. Compute the int32 matrix product of the resulting lhs times rhs.
+  4. Add result_offset to each entry of the result.
+  5. Multiply each entry of the result by the following fraction, and round
+     to the nearest integer:
+
+                        result_mult_int
+                        ---------------                                   (1)
+                        2^result_shift
+
+  6. Clamp the resulting int32 values to the [0..255] range and cast to uint8.
+
+Thus the caller of this interface is expected to have precomputed suitable
+quantization parameters
+
+The rationale for these parameters is as follows:
+  - The three offsets may improve quantization accuracy in cases where the
+    range of values is limited, and they also conveniently allow to reduce all
+    eight combinations of signednesses to just the unsigned*unsigned->unsigned
+    case. One may at first glance worry that these offsets would incur
+    substantial overhead to the GEMM computation, but that is actually not the
+    case thanks to a trick described below (see "Efficient handling of
+    offsets").
+  - The result_mult_int and result_shift parameters allow approximating
+    arbitrarily closely any real multiplier, as a fraction of the form given
+    in (1) above, without using floating-point arithmetic and without using
+    a division instruction (only a right shift).
+
+
+Efficient handling of offsets
+=============================
+
+At first glance it may seem like the above-described quantized computation
+scheme requires adding the lhs_offset and rhs_offset to each of the lhs and
+rhs matrix entries.
+
+Doing that in the GEMM kernel would incur substantial overhead:
+  - It would mean extra arithmetic work in the GEMM kernel;
+  - It would require storing the lhs_offset and rhs_offset in registers,
+    which would eat into the register space available for the rest of the
+    GEMM kernel.
+
+One may then consider adding the lhs_offset and rhs_offset once and for all
+to lhs and rhs blocks, in a GEMM implementation operating on one lhs block
+and one rhs block at a time. However, doing so would require storing lhs and
+rhs blocks in 32 bit (or at least in 16 bit in real-world cases), which would
+partially negate the memory bandwidth benefits of low-precision computation.
+
+Fortunately, there is another way to handle these offsets that has none of
+the costs of the approaches described above. The idea is as follows.
+
+Let P denote the matrix shaped like lhs, but filled with 1's.
+Let Q denote the matrix shaped like rhs, but filled with 1's.
+
+Adding lhs_offset to each entry of lhs, means adding lhs_offset * P to lhs.
+Adding rhs_offset to each entry of rhs, means adding rhs_offset * Q to rhs.
+
+Thus, as far as handling lhs_offset and rhs_offset goes, the matrix product to be
+computed is:
+
+  (lhs + lhs_offset * P) * (rhs + rhs_offset * Q)
+
+Expanding this (using distributivity of matrix multiplication over addition),
+we see that the above product is equal to the following sum of 4 terms:
+
+    lhs * rhs                                                             (2)
+  + lhs_offset * P * rhs
+  + lhs * rhs_offset * Q
+  + lhs_offset * rhs_offset * P * Q
+
+The first term, lhs * rhs, is just the matrix multiplication ignoring the
+offsets, i.e. as if lhs_offset==rhs_offset==0. Our claim here is that this
+is all what we have to compute in the GEMM kernel.
+
+In the second term, lhs_offset * P * rhs, notice that since P is filled
+with 1's, P * rhs has all its rows equal to each other, and equal to the
+row-vector of sums of all the entries in each column of rhs.
+
+Thus, we can compute the second term, lhs_offset * P * rhs, by summing
+each column of rhs. This produces a single row-vector, and in order to add the
+second term, we simply need to add this row-vector (multiplied by lhs_offset)
+to each row of the result. This is just a rank one update of the result
+(equivalently, the second term is a rank one matrix), and we can efficiently
+store it as a single vector.
+
+The third term, lhs * rhs_offset * Q, is entirely similar to the second one,
+and can be similarly computed by summing each row of lhs, storing this in a
+single column-vector, and later multiplying these sums by rhs_offset.
+
+The fourth term is a single constant, repeated into all the entries of the
+matrix. The matrix P * Q is filled with the single constant value 'depth'
+(the depth the the matrix product i.e. the number of columns of the lhs).
+Thus the fourth term is simply the rank zero update adding this constant
+to each matrix entry:
+
+  lhs_offset * rhs_offset * depth
+
+
+Implementation of this technique in gemmlowp
+============================================
+
+In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
+to prepare them for efficient repeated traversal by the kernel), we compute
+the sum of each row of the lhs block and the sum of each column of the rhs
+block.
+
+See in internal/pack.h, in the PackedSideBlock class, the following member:
+
+  // Handle on the additional buffer backing the vector of sums of slices
+  // associated with this block. Owned.
+  Allocator::Handle sums_of_each_slice_handle_;
+
+sums_of_each_slice_handle_ is the handle to the buffer allocated to store
+the vector containing sums of rows of lhs, or of sums of columns of rhs.
+
+After these rank one updates have been computed at the packing stage, they are
+ignored at the compute kernel stage, since that stage is only concerned
+with the first of the four terms in (2); they are only used at the unpacking
+stage. See the default/reference implementation, UnpackResultImpl, in
+internal/unpack.h.
diff --git a/doc/output.md b/doc/output.md
new file mode 100644
index 0000000..f9985a8
--- /dev/null
+++ b/doc/output.md
@@ -0,0 +1,53 @@
+# Output pipelines in gemmlowp
+
+In gemmlowp, the "output pipeline" is the process that takes a final `int32`
+accumulator value (the output of the compute/kernel stage), and processes it to
+obtain the final value (typically a `uint8` value) and write it to the
+destination matrix.
+
+Gemmlowp has some genericity in what arithmetic transformations take place in
+the output pipeline, so as to allow different users to implement different
+quantization paradigms. See [low-precision.md](low-precision.md) and
+[quantization.md](quantization.md).
+
+Besides implementing a quantization paradigms, the other thing that output
+pipelines are good for, is implementing fused operations where a matrix
+multiplication feeds into other operations applied to its result, without
+additional array traversals. For instance, when implementing neural network
+inference, one might have a Convolutional layer with a bias-addition and an
+activation. One then wants to feed the result of the matrix multiplication
+implementing the Convolutional operator itself, directly into the bias-addition
+and activation function. gemmlowp's output pipelines allow implementing that:
+the bias-addition and activation function are just additional stages in the
+output pipeline.
+
+## Usage
+
+The gemmlowp entry point allowing to use an arbitrary output pipeline is
+`GemmWithOutputPipeline` in [public/gemmlowp.h](../public/gemmlowp.h).
+
+The output pipeline is specified as a `std::tuple` of "output stages", each of
+which defining an elementary arithmetic transformation.
+
+All available output stages are defined in
+[public/output_stages.h](../public/output_stages.h).
+
+## Example usage
+
+The best part to see examples of using various output pipelines is in the unit
+test,
+
+```
+test/test.cc
+```
+
+specifically in this function:
+
+```
+TestOutputStages
+```
+
+Separately, a self-contained example showing how to use gemmlowp to compute a
+quantized matrix multiplication with a sounds quantization paradigm, is here:
+
+[doc/quantization_example.cc](quantization_example.cc)
diff --git a/doc/packing.md b/doc/packing.md
new file mode 100644
index 0000000..891236d
--- /dev/null
+++ b/doc/packing.md
@@ -0,0 +1,204 @@
+# The packing stage in gemmlowp
+
+## Introduction
+
+We assume familiarity with [design.md](design.md) and with the overall 3 stages
+of computations described there: packing, kernel, unpacking.
+
+This page goes into more details about the first stage: packing.
+
+We also assume familiarity with [kernel.md](kernel.md) as it describes the
+packed format requirements that the kernels expect, and that forms basically the
+contract that the packing stage must honor.
+
+Some parts below also assume familiarity with
+[low-precision.md](low-precision.md) as the packing stage also has to compute
+the vectors of sums or columns as described there.
+
+## The storage order of packed blocks, partly hidden behind sequential access
+
+As explained in [design.md](design.md), the primary purpose of packing is to
+ensure that when the kernel traverses Lhs/Rhs matrix data, it can do so
+efficiently thanks to having the data stored in an order that is as similar as
+possible to the order in which the compute stage has to traverse this data.
+
+This traversal order is nontrivial for the reasons outlined in
+[design.md](design.md): at the innermost level, one tries to work within
+registers as much as possible; at the next level, one tries to stay within L1
+cache as much as possible. The packed blocks that we handle are supposed to fit
+entirely in L2 cache.
+
+Thus it has become standard in GEMM design to describe complicated "Z-order" or
+"fractal order" storage for packed blocks.
+
+However, we should keep in mind that the whole point of the packed storage order
+is to be as similar as possible to the order of traversal during the compute
+stage. The storage order doesn't matter in itself; the only thing that matters
+is simple access patterns during the compute stage.
+
+This suggests the following approach to implementing packing: take the exact
+same hierarchy of nested loops of the compute stage, drop the loops that are not
+relevant to the side (Lhs or Rhs) being packed, and try to use mostly sequential
+access to the destination packed data.
+
+This hierarchy of nested loops can be seen in PackSideBlockImpl (PackL2, PackL1,
+PackRun), compare to the similar hierarchy of loops in internal/compute.h.
+
+In this way, the more intricate small-scale details or the packed data format
+never need to be made explicit (which would be complicated). We still use some
+"seeking" but only at larger scales, where the storage order is less \
+complicated to describe.
+
+### Sequential access to PackedSideBlock data
+
+See PackedSideBlock in internal/pack.h, specifically the following data members:
+
+```
+// Handle on the buffer backing this packed block. Owned.
+Allocator::Handle data_handle_;
+```
+
+and:
+
+```
+// pos_ is the current position in the buffer, which we access
+// sequentially, like a file.
+// The idea is that we pack data in the same order as it is
+// going to be traversed during the computation, which for
+// cache-friendliness reasons is complicated to random-access,
+// as the offsets calculations would be intricate. So we
+// give up random-access addressing, and instead content ourselves
+// with sequential access.
+//
+// pos_ is mutable because during the computation we will want to
+// be able to iterate on the data in a const PackedSideBlock.
+mutable int pos_;
+```
+
+The methods exposing sequential access are:
+
+```
+std::uint8_t* current_data() {
+  return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_;
+}
+```
+
+and:
+
+```
+void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; }
+
+void seek_forward_n_cells(int n) const {
+  pos_ += n * KernelSideFormat::Cell::kSize;
+}
+```
+
+### Random access to PackedSideBlock data at larger scales
+
+We still need some random access at larger scales (with high granularity), which
+is unavoidable since GEMM is O(n^3) and has to traverse each of the O(n^2)
+inputs O(n) times.
+
+The watershed between sequential access and random access is at the level of a
+'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer to the
+innermost GEMM loop in the depth dimension. That's the critical inner loop that
+must be as fast as possible, thus for which we absolutely want sequential access
+to packed data so that the storage order is optimal by construction. At larger
+scales i.e. between runs, we accept that the storage order is less optimal and
+since it's also less intricate, it's not too hard to implement random access
+there.
+
+This is done by the seek_run method:
+
+```
+void seek_run(int start_width, int start_depth) const {
+  int kernel_run_depth =
+      std::min<int>(params_.l1_depth, params_.l2_depth - start_depth);
+  pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth;
+}
+```
+
+We see that the formula involves the l1_depth parameter, which is how the packed
+storage order depends on L1 cache size. Again, the whole packed block is
+supposed to fit in L2 cache.
+
+## The innermost loop of the packing stage, PackRun, and PackingRegisterBlock
+
+Keeping with our consistent usage of the term 'Run' throughout gemmlowp, the
+innermost loop is called PackRun().
+
+Here we recall a very important principle that was explained in
+[kernels.md](kernels.md): the kernel is free to dictate the precise data format
+that it expects; the packing code has to honor it. So there's an asymmetry here:
+the kernel is the master, the packing is the slave. That's why the packing code
+is templatized in the KernelSideFormat. At larger scales, the packing is
+independent of kernel format details, but inside PackRun is where we take care
+of the small-scale details that do depend on the kernel format details. That's
+why it's a good thing that we only need sequential access here, as it would be
+very complicated to spell out random access at this scale.
+
+Anyway, PackRun.
+
+Since it is the critical inner loop, it is what we want to allow specializing
+for particular CPU architectures. To allow that, we handle at a time blocks of
+fixed dimensions, that is intended to be friendly enough to optimization. These
+blocks are PackingRegisterBlock's and their dimensions are:
+
+```
+  width = KernelWidth
+  depth = kRegisterSize
+```
+
+See [kernels.md](kernels.md) and internal/kernel.h for the former, and
+internal/common.h for the latter.
+
+See the comments around PackingRegisterBlock in internal/pack.h:
+
+```
+// A PackingRegisterBlock is a small fixed-size block of a matrix being
+// packed. This class is the generic non-optimized implementation,
+// it is inherited by the generic implementation of PackingRegisterBlock,
+// which may be overriden by template specialization. Overriding it is how
+// one may provide optimized packing code paths.
+//
+// The packing of a block proceeds in two steps:
+//   1. Ensuring that we have a complete block of source data, i.e. a block of
+//      the compile-time prescribed size. This is where we handle unaligned
+//      boundaries: if we don't have a complete block of source data, then
+//      we copy and zero-extend it into a local temporary (complete_src_),
+//      see MakeCompleteSrc. In the generic case, we do have a complete block,
+//      so we just use it in-place, see UseCompleteSrcInPlace.
+//   2. Packing a complete block into the destination, see Pack. This is the
+//      most critical part, so it's convenient that unaligned boundaries have
+//      already been handled in step 1.
+```
+
+## Other things that the packing stage has to do
+
+Besides storing matrix entries in a suitable order, the packing stages also has
+two other things to do.
+
+First, packing has to compute the vectors of sums of entries along the depth
+dimension. If this is any mysterious, read [low-precision.md](low-precision.md).
+These will only be used at the unpacking stage.
+
+Second, if the BitDepthSetting requires less than 8 bit of precision, then at
+the packing stage we have to requantize inputs accordingly. See
+[less-than-8-bit.md](less-than-8-bit.md) for details. This is the Requantize()
+function.
+
+## Specialized packing paths for specific formats on specific CPU architectures
+
+Please refer to internal/pack_neon.h for examples of doing that. The piece of
+code to be specialized is PackingRegisterBlock. However, inside of it, only the
+Pack() method typically needs to be specialized (the rest is unlikely to be
+critical). So one typically specializes PackingRegisterBlock but still
+inheriting PackingRegisterBlockBase to keep the generic stuff, and then one
+typically wants to override the Pack() method.
+
+Template specialization for the right template parameters is how one specifies
+in which case a given path is to be used in place of the generic packing code.
+
+It is entirely possible to set the value of kRegisterSize differently based on
+the CPU architecture (for example, 32 on x86 with AVX) as long as all the
+specialized packing paths used on that CPU architecture are consistent with it.
diff --git a/doc/packing.txt b/doc/packing.txt
new file mode 100644
index 0000000..2b94512
--- /dev/null
+++ b/doc/packing.txt
@@ -0,0 +1,206 @@
+                    The packing stage in gemmlowp
+                    *****************************
+
+
+Introduction
+============
+
+We assume familiarity with doc/design.txt and with the overall
+3 stages of computations described there: packing, kernel, unpacking.
+
+This page goes into more details about the first stage: packing.
+
+We also assume familiarity with doc/kernel.txt as it describes the
+packed format requirements that the kernels expect, and that forms
+basically the contract that the packing stage must honor.
+
+Some parts below also assume familiarity with doc/low-precision.txt
+as the packing stage also has to compute the vectors of sums or columns as
+described there.
+
+
+The storage order of packed blocks, partly hidden behind sequential access
+==========================================================================
+
+As explained in doc/design.txt, the primary purpose of packing is to
+ensure that when the kernel traverses Lhs/Rhs matrix data, it can do
+so efficiently thanks to having the data stored in an order that is
+as similar as possible to the order in which the compute stage has to
+traverse this data.
+
+This traversal order is nontrivial for the reasons outlined in
+doc/design.txt: at the innermost level, one tries to work within registers
+as much as possible; at the next level, one tries to stay within L1 cache
+as much as possible. The packed blocks that we handle are supposed
+to fit entirely in L2 cache.
+
+Thus it has become standard in GEMM design to describe complicated
+"Z-order" or "fractal order" storage for packed blocks.
+
+However, we should keep in mind that the whole point of the packed storage
+order is to be as similar as possible to the order of traversal during
+the compute stage. The storage order doesn't matter in itself; the only
+thing that matters is simple access patterns during the compute stage.
+
+This suggests the following approach to implementing packing: take the
+exact same hierarchy of nested loops of the compute stage, drop the loops
+that are not relevant to the side (Lhs or Rhs) being packed, and try to use
+mostly sequential access to the destination packed data.
+
+This hierarchy of nested loops can be seen in PackSideBlockImpl
+(PackL2, PackL1, PackRun), compare to the similar hierarchy of loops
+in internal/compute.h.
+
+In this way, the more intricate small-scale details or the packed data format
+never need to be made explicit (which would be complicated). We still use
+some "seeking" but only at larger scales, where the storage order is less\
+complicated to describe.
+
+
+Sequential access to PackedSideBlock data
+-----------------------------------------
+
+See PackedSideBlock in internal/pack.h, specifically the following data
+members:
+
+  // Handle on the buffer backing this packed block. Owned.
+  Allocator::Handle data_handle_;
+
+and:
+
+  // pos_ is the current position in the buffer, which we access
+  // sequentially, like a file.
+  // The idea is that we pack data in the same order as it is
+  // going to be traversed during the computation, which for
+  // cache-friendliness reasons is complicated to random-access,
+  // as the offsets calculations would be intricate. So we
+  // give up random-access addressing, and instead content ourselves
+  // with sequential access.
+  //
+  // pos_ is mutable because during the computation we will want to
+  // be able to iterate on the data in a const PackedSideBlock.
+  mutable int pos_;
+
+The methods exposing sequential access are:
+
+  std::uint8_t* current_data() {
+    return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_;
+  }
+
+and:
+
+  void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; }
+
+  void seek_forward_n_cells(int n) const {
+    pos_ += n * KernelSideFormat::Cell::kSize;
+  }
+
+
+Random access to PackedSideBlock data at larger scales
+------------------------------------------------------
+
+We still need some random access at larger scales (with high granularity),
+which is unavoidable since GEMM is O(n^3) and has to traverse each of the
+O(n^2) inputs O(n) times.
+
+The watershed between sequential access and random access is at the level
+of a 'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer
+to the innermost GEMM loop in the depth dimension. That's the critical
+inner loop that must be as fast as possible, thus for which we absolutely
+want sequential access to packed data so that the storage order is optimal
+by construction. At larger scales i.e. between runs, we accept that
+the storage order is less optimal and since it's also less intricate, it's
+not too hard to implement random access there.
+
+This is done by the seek_run method:
+
+  void seek_run(int start_width, int start_depth) const {
+    int kernel_run_depth =
+        std::min<int>(params_.l1_depth, params_.l2_depth - start_depth);
+    pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth;
+  }
+
+We see that the formula involves the l1_depth parameter, which is how the
+packed storage order depends on L1 cache size. Again, the whole packed
+block is supposed to fit in L2 cache.
+
+
+The innermost loop of the packing stage, PackRun, and PackingRegisterBlock
+==========================================================================
+
+Keeping with our consistent usage of the term 'Run' throughout gemmlowp,
+the innermost loop is called PackRun().
+
+Here we recall a very important principle that was explained in
+doc/kernels.txt: the kernel is free to dictate the precise data format
+that it expects; the packing code has to honor it. So there's an asymmetry
+here: the kernel is the master, the packing is the slave. That's why the
+packing code is templatized in the KernelSideFormat. At larger scales,
+the packing is independent of kernel format details, but inside PackRun is
+where we take care of the small-scale details that do depend on the kernel
+format details. That's why it's a good thing that we only need sequential
+access here, as it would be very complicated to spell out random access
+at this scale.
+
+Anyway, PackRun.
+
+Since it is the critical inner loop, it is what we want to allow specializing
+for particular CPU architectures. To allow that, we handle at a time blocks
+of fixed dimensions, that is intended to be friendly enough to optimization.
+These blocks are PackingRegisterBlock's and their dimensions are:
+  width = KernelWidth
+  depth = kRegisterSize
+See doc/kernels.txt and internal/kernel.h for the former, and internal/common.h
+for the latter.
+
+See the comments around PackingRegisterBlock in internal/pack.h:
+
+// A PackingRegisterBlock is a small fixed-size block of a matrix being
+// packed. This class is the generic non-optimized implementation,
+// it is inherited by the generic implementation of PackingRegisterBlock,
+// which may be overriden by template specialization. Overriding it is how
+// one may provide optimized packing code paths.
+//
+// The packing of a block proceeds in two steps:
+//   1. Ensuring that we have a complete block of source data, i.e. a block of
+//      the compile-time prescribed size. This is where we handle unaligned
+//      boundaries: if we don't have a complete block of source data, then
+//      we copy and zero-extend it into a local temporary (complete_src_),
+//      see MakeCompleteSrc. In the generic case, we do have a complete block,
+//      so we just use it in-place, see UseCompleteSrcInPlace.
+//   2. Packing a complete block into the destination, see Pack. This is the
+//      most critical part, so it's convenient that unaligned boundaries have
+//      already been handled in step 1.
+
+
+Other things that the packing stage has to do
+=============================================
+
+Besides storing matrix entries in a suitable order, the packing stages also has
+two other things to do.
+
+First, packing has to compute the vectors of sums of entries along the depth
+dimension. If this is any mysterious, read doc/low-precision.txt. These will
+only be used at the unpacking stage.
+
+Second, if the BitDepthSetting requires less than 8 bit of precision, then at
+the packing stage we have to requantize inputs accordingly.
+See doc/less-than-8-bit.txt for details. This is the Requantize() function.
+
+
+Specialized packing paths for specific formats on specific CPU architectures
+============================================================================
+
+Please refer to internal/pack_neon.h for examples of doing that. The piece
+of code to be specialized is PackingRegisterBlock. However, inside of it,
+only the Pack() method typically needs to be specialized (the rest is unlikely
+to be critical). So one typically specializes PackingRegisterBlock but still
+inheriting PackingRegisterBlockBase to keep the generic stuff, and then one
+typically wants to override the Pack() method.
+
+Template specialization for the right template parameters is how one specifies
+in which case a given path is to be used in place of the generic packing code.
+
+It is entirely possible to set the value of kRegisterSize differently based on
+the CPU architecture (for example, 32 on x86 with AVX) as long as all the
+specialized packing paths used on that CPU architecture are consistent with it.
diff --git a/doc/public.md b/doc/public.md
new file mode 100644
index 0000000..935f6db
--- /dev/null
+++ b/doc/public.md
@@ -0,0 +1,161 @@
+# Gemmlowp's public entry points
+
+gemmlowp's public interface is defined in
+[public/gemmlowp.h](../public/gemmlowp.h).
+
+## GemmWithOutputPipeline
+
+The primary public entry point is: `GemmWithOutputPipeline`.
+
+A usage example is given in
+[doc/quantization_example.cc](quantization_example.cc).
+
+The high-level overview of how this specifies a low-precision matrix
+multiplication is explained in [low-precision.md](low-precision.md). The
+rationale for a specific quantization paradigm is given in
+[quantization.md](quantization.md). That specific quantization paradigm is
+implemented at two different stages of the computation: as pre-processing ont
+the operands and as post-processing on the result:
+
+*   Pre-processing on the LHS, RHS operands, in the form of adding constant
+    `lhs_offset`, `rhs_offset` to them, is explained in
+    [low-precision.md](low-precision.md).
+
+*   Post-processing on the result, in the form of a flexible "output pipeline",
+    is explained in [output.md](output.md).
+
+More details on this below as we discuss specific function parameters.
+
+The prototype is:
+
+```
+template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
+          MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
+          typename OutputPipelineType, typename GemmContextType>
+void GemmWithOutputPipeline(GemmContextType* context,
+                            const MatrixMap<const InputScalar, LhsOrder>& lhs,
+                            const MatrixMap<const InputScalar, RhsOrder>& rhs,
+                            MatrixMap<OutputScalar, ResultOrder>* result,
+                            int lhs_offset, int rhs_offset,
+                            const OutputPipelineType& output_pipeline);
+```
+
+A typical call looks like (from the [usage example](quantization_example.cc)):
+
+```
+gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
+                                 gemmlowp::DefaultL8R8BitDepthParams>(
+    &gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
+    &uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
+```
+
+### Template parameters
+
+Typically only the 3 first template parameters need to be specified, the rest
+being automatically deduced from function parameters:
+
+*   `InputScalar`: The scalar type of the LHS and RHS operands. At the moment,
+    this must be `std::uint8_t`.
+*   `OutputScalar`: The scalar type of the LHS and RHS operands. At the moment,
+    this must be `std::uint8_t`.
+*   `BitDepthParams`: Defines the bit format of the input and output matrices
+    and the required accuracy of the computation. At the moment, the only
+    non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See
+    [less-than-8-bit.md](less-than-8-bit.md) for other values and the general
+    idea of this, and how it may become more useful in the future.
+
+The other template parameters, which typically do not need to be specified, are:
+
+*   `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or
+    column-major) of the LHS, RHS, result matrices. See
+    [public/map.h](../public/map.h). See the below performance note: we
+    recommend using respectively RowMajor, ColMajor, ColMajor for optimal
+    performance.
+*   `OutputPipelineType`: the actual `std::tuple` type of the output pipeline.
+    See below explanation of the `output_pipeline` parameter, and
+    [output.md](output.md).
+*   `GemmContextType`: the type of the `context` parameter. At the moment, this
+    must be `gemmlowp::GemmContext`.
+
+### Function parameters
+
+The function parameters taken by `GemmWithOutputPipeline` are:
+
+*   `context`: The `gemmlowp::GemmContext` object holding state and resources to
+    be used for this gemmlowp call.
+*   `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are
+    `MatrixMap` objects, mapping external buffers as matrices, not owning data.
+    See [public/map.h](../public/map.h).
+*   `result`: pointer to the destination `MatrixMap` object, which must be
+    already constructed, wrapping the external destination buffer with the
+    wanted destination matrix shape and storage layout. No memory allocation
+    will be performed by gemmlowp for the destination buffer. See
+    [public/map.h](../public/map.h).
+*   `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the
+    LHS, RHS matrices respectively, as explained in
+    [low-precision.md](low-precision.md). This is only the part of the
+    quantization paradigm explained in [quantization.md](quantization.md) that
+    needs to be implemented as operations on the operands; everything else is
+    operations on the result, see `output_pipeline`.
+*   `output_pipeline` is a `std::tuple` of output stages (see
+    [public/output_stages.h](../public/output_stages.h)), specifying the output
+    pipeline (see [output.md](output.md)). This is the part of the quantization
+    paradigm explained in [quantization.md](quantization.md) that needs to be
+    implemented as operations on the result matrix.
+
+### Performance note on storage orders.
+
+gemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
+result matrices. However, not all are equally optimized for.
+
+Because gemmlowp is primarily aimed at neural network inference workloads,
+optimization focus is on this particular combination of storage orders:
+
+*   `LhsOrder=RowMajor`
+*   `RhsOrder=ColMajor`
+*   `ResultOrder=ColMajor`
+
+The rationale is that the LHS is typically the constant weights of a neural
+network layer (e.g. the weights of a Convolutional layer implemented as a matrix
+multiplication), while the RHS and result are neural network activations,
+respectively the input and output activations of the layer.
+
+Because the RHS and result are activations, we want them to share the same
+storage order -- so that one layer's output activations can be readily used as
+the next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`.
+
+We also know from general considerations on matrix multiplication that it is
+slightly more efficient to have the direction of accumulation (the "depth"
+dimension) be the direction of contiguous storage in memory. That means that it
+is always going to be slightly easier and more efficient to have
+`LhsOrder=RowMajor` and `RhsOrder=ColMajor`.
+
+Putting this together, we arrive at gemmlowp's focus on the above-described
+combination of storage orders.
+
+Using other storage orders will typically mean taking less efficient paths in
+the packing and unpacking stages, see [packing.md](packing.md). The compute
+kernel stage ([kernel.md](kernel.md)) is unaffected.
+
+## GemmWithOutputPipelinePC
+
+This is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of
+scalar. They are then broadcasted against LHS, RHS respectively.
+
+This is useful for some flavors of neural network inference with "per-channel
+quantization", whence the PC suffix. This has been useful in some settings where
+a neural network trained in float arithmetic was subsequently quantized. On the
+other hand, retraining neural networks for quantized inference tends to remove
+the need for per-channel quantization. For that reason, the long-term usefulness
+of this entry point is in question.
+
+## Gemm
+
+This is gemmlowp's original, now legacy and deprecated, entry point. See the
+section of [low-precision.md](low-precision.md) on the legacy quantization
+paradigm. Avoid in new code.
+
+## The eight_bit_int_gemm directory
+
+As explained in the top-level [README.md](../README.md#public-interfaces), this
+is entirely deprecated.
diff --git a/doc/quantization.md b/doc/quantization.md
new file mode 100644
index 0000000..3e0df16
--- /dev/null
+++ b/doc/quantization.md
@@ -0,0 +1,346 @@
+# Building a quantization paradigm from first principles
+
+**TLDR:** If you prefer example code over theory, look at
+[doc/quantization_example.cc](quantization_example.cc).
+
+## Overview
+
+gemmlowp allows to perform calculations on matrices on uint8 values, but these
+matrices are only useful insofar as they somehow approximate matrices of real
+numbers. By a _quantization paradigm_ we mean a correspondence between matrices
+of quantized 8bit values and matrices of real numbers. The choice of a
+quantization paradigm affects the calculations that gemmlowp itself needs to
+perform, specifically, it affects how one goes from internal 32bit accumulator
+to final 8bit outputs.
+
+The part of gemmlowp transforming internal internal 32bit accumulator to final
+8bit outputs is the "output pipeline" described in [output.md](output.md).
+
+gemmlowp's `GemmWithOutputPipeline` entry point allows specifying an arbitrary
+output pipeline, allowing the user to implement their own preferred quantized
+arithmetic paradigm.
+
+In the present document, our purpose is to show how, reasoning from first
+principles and some domain-specific knowledge of neural networks, we can arrive
+naturally at some specific quantization paradigm, and how that can be
+implemented using a specific output pipeline.
+
+We also aim to show how that differs from the older, legacy quantization
+paradigm implemented by gemmlowp's legacy interfaces and why the change to the
+newer quantization paradigm described in this document was useful as far as some
+applications of gemmlowp were concerned.
+
+## Quantization as an affine map.
+
+In order for arithmetic on real values to map directly to arithmetic on
+quantized uint8 values, the mapping between real and quantized uint8 values must
+be affine, which means it must be of the form
+
+```
+real_value = A * quantized_value + B             (1)
+```
+
+for some constants A, B, or equivalently, of the form
+
+```
+real_value = C * (quantized_value + D)           (2)
+```
+
+for some constants C, D. Indeed, anything else than such an affine map would
+mean that the result of the quantized calculations do no longer readily provide
+an approximation to the result of the real-numbers calculation.
+
+## Domain-specific constraint: the real value 0 must be exactly representable.
+
+Here a domain-specific constrain from neural networks appears: for some neural
+network layers, it is very useful for optimized implementations that the
+real-value 0 be exactly representable.
+
+For instance, in a Convolutional or Pooling layer with padding, it is useful to
+be able to implement the padding by zero-padding the input array, so that
+optimized loops do not need to become more complex to avoid overrunning the
+array bounds.
+
+In order for such zero-padding to be feasible in a quantized implementation of
+such layers, it is important that the real value '0' be exactly representable in
+quantized form, i.e. that it correspond exactly to some quantized value, which
+we call the _zero-point_.
+
+Indeed, if '0' were not exactly representable, then we would have to use some
+quantized value for padding, that does not exactly correspond to the real value
+'0'. That would typically introduce inaccuracy in the result. In fact, using
+always the same such value would be worse: it would introduce _bias_ in the
+result.
+
+## The final form of the quantization equation
+
+Now let us phrase what this constraint &mdash; that the real value 0 be exactly
+representable &mdash; means in either quantization equations, (1) and (2).
+
+In equation (1), plugging `real_value = 0` and `quantized_value = zero_point`,
+we get:
+
+```
+0 = A * zero_point + B
+```
+
+equivalently:
+
+```
+zero_point = -B / A
+```
+
+We are thus left with a rather awkward constraint: the real number `-B / A` must
+somehow be guaranteed to be exactly integral, so that the special uint8 value
+`zero_point` can be exactly equal to it. Quite awkward!
+
+Now let us look at equation (2). Plugging `real_value = 0` and
+`quantized_value = zero_point`, we get:
+
+```
+0 = C * (zero_point + D)
+```
+
+Conveniently, the constant `C` plays no role anymore, so this equation
+simplifies to:
+
+```
+0 = zero_point + D
+```
+
+In other words, `D = -zero_point`. This suggests rewriting the quantization
+equation (2) into the following form (3), which will be the final form that we
+will consistently use:
+
+```
+real_value = scale * (quantized_value - zero_point)        (3)
+```
+
+To go from (2) to (3), we merely renamed `C` to `scale` and `D` to
+`-zero_point`.
+
+With this quantization equation (3), the condition that 0 be exactly
+representable is vacuously satisfied: `zero_point` is by definition one of the
+possible `quantized_value`'s, and equation (3) maps it to a `real_value` of
+exactly 0.
+
+Note that the final quantizaton equation (3) depends on two constants, one
+integral, the other an arbitrary positive real number:
+
+*   `zero_point` is integral, more specifically is one of the possible quantized
+    values (i.e. typically is a uint8 value).
+*   `scale` is a positive real number. Thus at this stage we have not yet shown
+    how to eliminate all usage of floating-point arithmetic. That will come
+    below.
+
+## Quantizing a matrix multiplication
+
+Now that we know &mdash; equation (3) &mdash; how real numbers are to correspond
+to quantized values (typically uint8), we turn to applying this knowledge to
+rewriting a multiplication of matrices of real numbers, by the equivalent
+multiplication of matrices of quantized values.
+
+Say that we have two matrices of real values `lhs_real_matrix`,
+`rhs_real_matrix`. Each entry of their product is the sum (accumulation) of many
+products of individual matrix entries, say `lhs_real_value * rhs_real_value`.
+
+Now suppose that we have already quantized these two matrices according to the
+above equation (3), with some already-known quantization parameters `lhs_scale`,
+`rhs_scale`, `lhs_zero_point`, `rhs_zero_point`, so that their matrix entries
+are quantized as
+
+```
+lhs_real_value[i] = lhs_scale * (lhs_quantized_value[i] - lhs_zero_point)
+rhs_real_value[i] = rhs_scale * (rhs_quantized_value[i] - rhs_zero_point)
+```
+
+We then rewrite the matrix product accumulator accordingly:
+
+```
+result_real_value
+  = Sum_over_i(lhs_real_value[i] * rhs_real_value[i])
+  = Sum_over_i(
+        lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) *
+        rhs_scale * (rhs_quantized_value[i] - rhs_zero_point)
+    )
+  = lhs_scale * rhs_scale * Sum_over_i(
+        (lhs_quantized_value[i] - lhs_zero_point) *
+        (rhs_quantized_value[i] - rhs_zero_point)
+    )                                                      (4)
+```
+
+Now our goal is to represent this result itself as a quantized matrix, i.e.
+still according to equation (3), for some pre-established quantization
+parameters `result_scale` and `result_zero_point`, as
+
+```
+result_real_value = result_scale *
+    (result_quantized_value - result_zero_point)
+```
+
+Here we need to keep in mind that our goal is to specify what the quantized
+matrix multiplication should do, i.e. how to compute `result_quantized_value`.
+The last equation above is equivalent to
+
+```
+result_quantized_value = result_zero_point +
+    result_real_value / result_scale
+```
+
+Now we can use equation (4) above to plug into this the expression of
+result_real_value in terms of the quantized operands, and we obtain:
+
+```
+result_quantized_value = result_zero_point +
+    (lhs_scale * rhs_scale / result_scale) *
+        Sum_over_i(
+            (lhs_quantized_value[i] - lhs_zero_point) *
+            (rhs_quantized_value[i] - rhs_zero_point)
+        )                                                  (5)
+```
+
+Equation (5) is the conclusion of this general discussion of how to specify what
+"quantized matrix multiplication" should actually compute, in order to be able
+to replace real matrix multiplications.
+
+## Implementation of quantized matrix multiplication
+
+Having obtained the mathematical form (5) of quantized matrix multiplication, we
+now turn to its actual implementation.
+
+The inner-most part of (5),
+
+```
+int32_accumulator =
+    Sum_over_i(
+        (lhs_quantized_value[i] - lhs_zero_point) *
+        (rhs_quantized_value[i] - rhs_zero_point)
+)
+```
+
+is the "kernel" accumulation loop. It is where the bulk of the computational
+cost goes. Luckily, it only involves integers: the quantized operands matrix
+entries, and their `zero_point` quantization parameters. Typically, all of these
+values are uint8. Typically, the above differences of uint8 values would be
+represented as signed int16; their products as signed int32.
+
+It is out of scope of the present doc to discuss how to avoid the overhead of
+having to subtract these `zero_point` constants in this inner loop; refer to
+[this section of
+low-precision.md](low-precision.md#efficient-handling-of-offsets) for that. The
+gist of it is that a mathematical trick allows us to take the handling of these
+`zero_point` constants out of this accumulation loop, so that it simplifies to
+
+```
+int32_accumulator =
+    Sum_over_i(
+      lhs_quantized_value[i] *
+      rhs_quantized_value[i]
+    )                                                      (6)
+```
+
+Anyway, the result is a `int32_accumulator` that we now plug back into the rest
+of (5):
+
+```
+result_quantized_value = result_zero_point +
+    (lhs_scale * rhs_scale / result_scale) * int32_accumulator       (7)
+```
+
+The difficulty here is of course that `(lhs_scale * rhs_scale / result_scale)`
+is a positive real number, not an integer in general. It is a constant, though.
+So what we have to implement here is the (approximate) scaling of a int32 value
+by some arbitrary positive constant multiplier.
+
+Moreover, it is safe to assume that this positive constant multiplier is smaller
+than one &mdash; each of the `scale` values here is typically smaller than one,
+as we are typically mapping the `[0..255]` quantized uint8 value range to an
+interval of real values that is much narrower than that, typically within
+`[-10,10]` in most neural networks. For example, a neural network using Relu6
+activation functions will typically have real activation values in the interval
+[0,6].
+
+So how do we implement the multiplication of a int32 value by a positive real
+constant that is smaller than one? Typically, by multiplying by a fixed-point
+constant multiplier in the normalized interval `[1/2,1)`, and right-shifting
+the result to achieve the correct multiplier.
+
+At this point we have obtained the int32 value of the product
+
+```
+(lhs_scale * rhs_scale / result_scale) * int32_accumulator
+```
+
+Looking at (7), it only remains to add to it the integral value
+`result_zero_point`, and we are done.
+
+## How this is implemented in gemmlowp
+
+The different parts of gemmlowp implementing aspects of the above discussion
+are:
+
+*   The packing stage (see [packing.md](packing.md)) implements the special
+    mathematical trick to handle `lhs_offset`, `rhs_offset` that we alluded to
+    above, see [this section of
+    low-precision.md](low-precision.md#efficient-handling-of-offsets) for
+    details. Thanks to is, the rest of the calculation can proceed as if
+    `lhs_offset`, `rhs_offset` were 0.
+
+*   The compute/kernel stage (see [kernel.md](kernel.md)) performs the core
+    accumulation loop producing the `int32_accumulator`, see equation (6) above.
+
+*   The unpacking stage feeds into the output pipeline (see
+    [output.md](output.md)), which implements the rest of the evaluation of the
+    above equation (5), that we discussed in the previous section.
+
+Now, the point of gemmlowp's flexible output-pipelines mechanism (see
+[output.md](output.md)) is to support different quantization paradigms, so we
+now have to specify which particular flavor of output pipeline corresponds to
+the particular quantization paradigm that we detailed above in this document.
+
+The specific output pipeline stage implementing the present quantization
+paradigm, i.e. implementing the precise computation detailed in the previous
+section (equation (5)), is
+`OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`.
+
+Please refer to the comment explaining it in
+[public/output_stages.h](../public/output_stages.h).
+
+## How this differs from the older legacy gemmlowp quantization paradigm
+
+The difference between the older legacy quantization paradigm described in
+[low-precision.md](low-precision.md) and the newer one described in this
+document boils down to the difference between the legacy output stage
+implementing it, `OutputStageQuantizeDownInt32ToUint8Scale`, and the new output
+stage implementing the new paradigm,
+`OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`.
+
+Please refer to the comments in
+[public/output_stages.h](../public/output_stages.h) for details about these two
+output stages and how they differ.
+
+Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are:
+
+1.  The int32 accumulators (inputs to the output stage) undergo a plain int32
+    multiplication with a int32 multiplier, which may overflow. By contrast, in
+    the newer `OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint`, this
+    integer multiplication becomes a fixed-point multiplication and cannot
+    overflow.
+
+    *   In practice, to limit the risk of overflow, this pushes users to choose
+        smaller values for this integer multiplier, which means limited
+        multiplicative accuracy, which may cause multiplicative bias depending
+        on how it is used.
+
+2.  Note how the order of multiplying by the multipler and adding the
+    `result_offset` are swapped. This reflects a quantizatin equation of the
+    form (1) above, as opposed to the form (2)/(3) that the new quantization
+    paradigm uses. As a result, it is essentially impossible to guarantee that 0
+    is an exactly-representable value, which as discussed above is an issue at
+    least in some convolutional neural network applications.
+
+## Example code illustrating the new quantization paradigm
+
+Example code showing how to perfom a quantized matrix multiplication in the
+quantization paradigm discussed here is in
+[doc/quantization_example.cc](quantization_example.cc).
diff --git a/doc/quantization_example.cc b/doc/quantization_example.cc
new file mode 100644
index 0000000..4368de2
--- /dev/null
+++ b/doc/quantization_example.cc
@@ -0,0 +1,391 @@
+// Example code illustrating the theory exposed in doc/quantization.md
+
+/* Command line to build and run on x86:
+
+c++ doc/quantization_example.cc -I . --std=c++11 -msse4.1 -lpthread \
+  -o /tmp/quantization_example && \
+/tmp/quantization_example
+
+*/
+
+#include <algorithm>
+#include <cassert>
+#include <cmath>
+#include <cstdint>
+#include <iostream>
+#include <random>
+#include <vector>
+#include "../public/gemmlowp.h"
+#include "../public/output_stages.h"
+
+// We will handle both float and quantized matrices, which we will
+// represent as gemmlowp::MatrixMap.
+// We will need to be able to print them.
+
+// Output a matrix to a std::ostream
+template <typename tScalar, gemmlowp::MapOrder tOrder>
+std::ostream& operator<<(std::ostream& s,
+                         const gemmlowp::MatrixMap<tScalar, tOrder>& m) {
+  for (int i = 0; i < m.rows(); i++) {
+    for (int j = 0; j < m.cols(); j++) {
+      if (j) {
+        s << '\t';
+      }
+      s << static_cast<float>(m(i, j));
+    }
+    s << '\n';
+  }
+  return s;
+}
+
+// Find the min and max value in a float matrix.
+template <gemmlowp::MapOrder tOrder>
+void FindMinMax(const gemmlowp::MatrixMap<float, tOrder>& m, float* min,
+                float* max) {
+  *min = *max = m(0, 0);
+  for (int i = 0; i < m.rows(); i++) {
+    for (int j = 0; j < m.cols(); j++) {
+      const float val = m(i, j);
+      *min = std::min(*min, val);
+      *max = std::max(*max, val);
+    }
+  }
+}
+
+// A structure to hold quantization parameters 'scale' and 'zero_point'
+// as discussed in doc/quantization.md. As explained there, the meaning
+// of these values is as the constants in the quantization equation
+//
+//   real_value = scale * (quantized_value - zero_point)
+//
+// In other words, 'zero_point' is the quantized value that corresponds
+// to the real value 0, and 'scale' is the difference of real values
+// corresponding to consecutive quantized values.
+struct QuantizationParams {
+  float scale;
+  std::uint8_t zero_point;
+};
+
+// Given the min and max values of a float array, return
+// reasonable quantization parameters to use for this array.
+QuantizationParams ChooseQuantizationParams(float min, float max) {
+  // We extend the [min, max] interval to ensure that it contains 0.
+  // Otherwise, we would not meet the requirement that 0 be an exactly
+  // representable value.
+  min = std::min(min, 0.f);
+  max = std::max(max, 0.f);
+
+  // the min and max quantized values, as floating-point values
+  const float qmin = 0;
+  const float qmax = 255;
+
+  // First determine the scale.
+  const double scale = (max - min) / (qmax - qmin);
+
+  // Zero-point computation.
+  // First the initial floating-point computation. The zero-point can be
+  // determined from solving an affine equation for any known pair
+  // (real value, corresponding quantized value).
+  // We know two such pairs: (rmin, qmin) and (rmax, qmax).
+  // Let's use the first one here.
+  const double initial_zero_point = qmin - min / scale;
+
+  // Now we need to nudge the zero point to be an integer
+  // (our zero points are integer, and this is motivated by the requirement
+  // to be able to represent the real value "0" exactly as a quantized value,
+  // which is required in multiple places, for example in Im2col with SAME
+  // padding).
+  std::uint8_t nudged_zero_point = 0;
+  if (initial_zero_point < qmin) {
+    nudged_zero_point = qmin;
+  } else if (initial_zero_point > qmax) {
+    nudged_zero_point = qmax;
+  } else {
+    nudged_zero_point =
+        static_cast<std::uint8_t>(std::round(initial_zero_point));
+  }
+
+  QuantizationParams result;
+  result.scale = scale;
+  result.zero_point = nudged_zero_point;
+  return result;
+}
+
+template <gemmlowp::MapOrder tLhsOrder, gemmlowp::MapOrder tRhsOrder,
+          gemmlowp::MapOrder tResultOrder>
+void FloatMatrixMultiplication(
+    const gemmlowp::MatrixMap<const float, tLhsOrder>& lhs,
+    const gemmlowp::MatrixMap<const float, tRhsOrder>& rhs,
+    gemmlowp::MatrixMap<float, tResultOrder>* result) {
+  assert(lhs.cols() == rhs.rows());
+  assert(lhs.rows() == result->rows());
+  assert(rhs.cols() == result->cols());
+  for (int i = 0; i < lhs.rows(); i++) {
+    for (int k = 0; k < rhs.cols(); k++) {
+      (*result)(i, k) = 0;
+      for (int j = 0; j < lhs.cols(); j++) {
+        (*result)(i, k) += lhs(i, j) * rhs(j, k);
+      }
+    }
+  }
+}
+
+void Quantize(const QuantizationParams& qparams, const std::vector<float>& src,
+              std::vector<std::uint8_t>* dst) {
+  assert(src.size() == dst->size());
+  for (std::size_t i = 0; i < src.size(); i++) {
+    const float real_val = src[i];
+    const float transformed_val = qparams.zero_point + real_val / qparams.scale;
+    const float clamped_val = std::max(0.f, std::min(255.f, transformed_val));
+    (*dst)[i] = static_cast<std::uint8_t>(std::round(clamped_val));
+  }
+}
+
+void Dequantize(const QuantizationParams& qparams,
+                const std::vector<std::uint8_t>& src, std::vector<float>* dst) {
+  assert(src.size() == dst->size());
+  for (std::size_t i = 0; i < src.size(); i++) {
+    const std::uint8_t quantized_val = src[i];
+    (*dst)[i] = qparams.scale * (quantized_val - qparams.zero_point);
+  }
+}
+
+template <typename tScalar, gemmlowp::MapOrder tOrder>
+class MatrixWithStorage {
+ public:
+  MatrixWithStorage(int rows, int cols)
+      : storage(rows * cols), matrix_map(storage.data(), rows, cols) {}
+  void MakeRandom() {
+    static std::mt19937 random_engine;
+    std::uniform_real_distribution<float> distribution(-1, 1);
+    for (auto& x : storage) {
+      x = static_cast<tScalar>(distribution(random_engine));
+    }
+  }
+  gemmlowp::MatrixMap<const tScalar, tOrder> ConstMap() const {
+    return gemmlowp::MatrixMap<const tScalar, tOrder>(
+        storage.data(), matrix_map.rows(), matrix_map.cols());
+  }
+  gemmlowp::MatrixMap<tScalar, tOrder> Map() {
+    return gemmlowp::MatrixMap<tScalar, tOrder>(
+        storage.data(), matrix_map.rows(), matrix_map.cols());
+  }
+  const std::vector<tScalar>& Storage() const { return storage; }
+  std::vector<tScalar>& Storage() { return storage; }
+
+ private:
+  std::vector<tScalar> storage;
+  gemmlowp::MatrixMap<tScalar, tOrder> matrix_map;
+};
+
+template <typename tScalar, gemmlowp::MapOrder tOrder>
+std::ostream& operator<<(std::ostream& s,
+                         const MatrixWithStorage<tScalar, tOrder>& m) {
+  return s << m.ConstMap();
+}
+
+// Given a real_multiplier in the interval (0, 1),
+// produces a pair (quantized_multiplier, right_shift) where
+// quantized_multiplier is an int32 representing a fixed-point value
+// in the interval [-1, 1)  (in practice we only produce positive values)
+// and right_shift is an amount to shift right by, so that the
+// floating-point multiplication of some int32 input value by real_multiplier,
+//
+//   return static_cast<int32>(int32_value * real_multiplier);
+//
+// is best approximated by the integer-arithmetic-only code
+//
+//   return RoundingRightShift(
+//       FixedPointMultiplication(int32_value, quantized_multiplier),
+//       right_shift);
+//
+// This is how to obtain the fixed-point multiplier and right shift
+// parameters to pass to
+// OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint.
+//
+// Note: all this code only needs to run offline to generate the quantized
+// neural network workload, not at runtime on the
+// device on which quantized neural networks need to run. So it's not
+// performance-critical at all.
+void QuantizeMultiplierSmallerThanOne(float real_multiplier,
+                                      std::int32_t* quantized_multiplier,
+                                      int* right_shift) {
+  assert(real_multiplier > 0.f);
+  assert(real_multiplier < 1.f);
+  int s = 0;
+  // We want to bring the real multiplier into the interval [1/2, 1).
+  // We can do so by multiplying it by two, and recording how many times
+  // we multiplied by two so that we can compensate that by a right
+  // shift by the same amount.
+  while (real_multiplier < 0.5f) {
+    real_multiplier *= 2.0f;
+    s++;
+  }
+  // Now that the real multiplier is in [1/2, 1), we convert it
+  // into a fixed-point number.
+  std::int64_t q =
+      static_cast<std::int64_t>(std::round(real_multiplier * (1ll << 31)));
+  assert(q <= (1ll << 31));
+  // Handle the special case when the real multiplier was so close to 1
+  // that its fixed-point approximation was undistinguishable from 1.
+  // We handle this by dividing it by two, and remembering to decrement
+  // the right shift amount.
+  if (q == (1ll << 31)) {
+    q /= 2;
+    s--;
+  }
+  assert(s >= 0);
+  assert(q <= std::numeric_limits<std::int32_t>::max());
+  *quantized_multiplier = static_cast<std::int32_t>(q);
+  *right_shift = s;
+}
+
+int main() {
+  std::cout.precision(3);
+
+  const int rows = 2;
+  const int depth = 4;
+  const int cols = 3;
+  const auto kOrder = gemmlowp::MapOrder::ColMajor;
+
+  std::cout << "First, let us make some float matrices LHS and RHS, "
+            << "and compute their product.\n"
+            << std::endl;
+  MatrixWithStorage<float, kOrder> float_lhs(rows, depth);
+  float_lhs.MakeRandom();
+  MatrixWithStorage<float, kOrder> float_rhs(depth, cols);
+  float_rhs.MakeRandom();
+  MatrixWithStorage<float, kOrder> reference_float_result(rows, cols);
+  auto reference_float_result_map = reference_float_result.Map();
+  FloatMatrixMultiplication(float_lhs.ConstMap(), float_rhs.ConstMap(),
+                            &reference_float_result_map);
+  std::cout << "Here is the float LHS matrix:\n" << float_lhs << std::endl;
+  std::cout << "Here is the float RHS matrix:\n" << float_rhs << std::endl;
+  std::cout << "Here is the float product (LHS * RHS) matrix obtained by "
+            << "ordinary float matrix multiplication, i.e. as far as we are "
+            << "concerned, the REFERENCE RESULT:\n"
+            << reference_float_result << std::endl;
+
+  std::cout
+      << "Now we embark on reproducing this result using "
+      << "quantized arithmetic. The code below splits into two parts: "
+      << "quantization code that only needs to run offline (e.g. to "
+      << "generate a quantized neural network workload), and actual "
+      << "runtime quantized code, which is typically performance-critical "
+      << "and where we typically do not want to use any floating-point "
+      << "arithmetic. We want to clearly distinguish between the two.\n"
+      << std::endl;
+
+  std::cout << "The below is OFFLINE QUANTIZATION CODE. We still use some "
+            << "floating-point arithmetic in the process of generating the "
+            << "quantized workload to be run on-device.\n"
+            << std::endl;
+
+  std::cout
+      << "Now, let us choose quantization parameters for these matrices. "
+      << "You might ask, what good is quantization if we need to pick "
+      << "quantization parameters for the result before we can run the "
+      << "quantized computation to obtain the result? The idea is that we "
+      << "target applications such as neural networks, where unknown results "
+      << "are only allowed to vary within preexisting bounds. In practice, the "
+      << "bounds for the results are typically learned during the neural "
+         "network "
+      << "training process. The min and max of the result do not have to be "
+      << "exact. If they are too broad, we just get lower quantization "
+         "accuracy. "
+      << "If they are too narrow, we just get clamping at the bounds.\n"
+      << std::endl;
+
+  float lhs_min, lhs_max, rhs_min, rhs_max, result_min, result_max;
+  FindMinMax(float_lhs.Map(), &lhs_min, &lhs_max);
+  FindMinMax(float_rhs.Map(), &rhs_min, &rhs_max);
+  FindMinMax(reference_float_result.Map(), &result_min, &result_max);
+  const auto lhs_qparams = ChooseQuantizationParams(lhs_min, lhs_max);
+  const auto rhs_qparams = ChooseQuantizationParams(rhs_min, rhs_max);
+  const auto result_qparams = ChooseQuantizationParams(result_min, result_max);
+
+  std::cout << "For LHS, we have min = " << lhs_min << ", max = " << lhs_max
+            << ", scale = " << lhs_qparams.scale
+            << ", zero_point = " << static_cast<float>(lhs_qparams.zero_point)
+            << std::endl;
+  std::cout << "For RHS, we have min = " << rhs_min << ", max = " << rhs_max
+            << ", scale = " << rhs_qparams.scale
+            << ", zero_point = " << static_cast<float>(rhs_qparams.zero_point)
+            << std::endl;
+  std::cout << "For the result, we have min = " << result_min
+            << ", max = " << result_max << ", scale = " << result_qparams.scale
+            << ", zero_point = "
+            << static_cast<float>(result_qparams.zero_point) << std::endl;
+
+  std::cout << std::endl;
+
+  MatrixWithStorage<std::uint8_t, kOrder> uint8_lhs(rows, depth);
+  MatrixWithStorage<std::uint8_t, kOrder> uint8_rhs(depth, cols);
+  MatrixWithStorage<std::uint8_t, kOrder> actual_uint8_result(rows, cols);
+
+  Quantize(lhs_qparams, float_lhs.Storage(), &uint8_lhs.Storage());
+  Quantize(rhs_qparams, float_rhs.Storage(), &uint8_rhs.Storage());
+
+  std::cout << "Quantized uint8 LHS matrix:\n" << uint8_lhs << std::endl;
+  std::cout << "Quantized uint8 RHS matrix:\n" << uint8_rhs << std::endl;
+
+  const int lhs_offset = -lhs_qparams.zero_point;
+  const int rhs_offset = -rhs_qparams.zero_point;
+  const int result_offset = result_qparams.zero_point;
+
+  const float real_multiplier =
+      lhs_qparams.scale * rhs_qparams.scale / result_qparams.scale;
+  std::int32_t quantized_multiplier;
+  int right_shift;
+  QuantizeMultiplierSmallerThanOne(real_multiplier, &quantized_multiplier,
+                                   &right_shift);
+
+  std::cout << "End of OFFLINE QUANTIZATION CODE.\n" << std::endl;
+
+  std::cout << "The below is ON-DEVICE RUNTIME QUANTIZED CODE. "
+            << "This is the part that is performance-critical and may only "
+            << "use quantized arithmetic.\n"
+            << std::endl;
+
+  gemmlowp::OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint
+      quantize_down_stage;
+  quantize_down_stage.result_offset_after_shift = result_offset;
+  quantize_down_stage.result_fixedpoint_multiplier = quantized_multiplier;
+  quantize_down_stage.result_shift = right_shift;
+  gemmlowp::OutputStageSaturatingCastToUint8 saturating_cast_stage;
+  const auto& output_pipeline =
+      std::make_tuple(quantize_down_stage, saturating_cast_stage);
+
+  auto actual_uint8_result_map = actual_uint8_result.Map();
+  gemmlowp::GemmContext gemm_context;
+  gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
+                                   gemmlowp::DefaultL8R8BitDepthParams>(
+      &gemm_context, uint8_lhs.ConstMap(), uint8_rhs.ConstMap(),
+      &actual_uint8_result_map, lhs_offset, rhs_offset, output_pipeline);
+
+  std::cout << "Quantized uint8 result matrix obtained by quantized "
+            << "multiplication:\n"
+            << actual_uint8_result << std::endl;
+
+  std::cout << "End of ON-DEVICE RUNTIME QUANTIZED CODE.\n" << std::endl;
+
+  MatrixWithStorage<float, kOrder> actual_float_result(rows, cols);
+  Dequantize(result_qparams, actual_uint8_result.Storage(),
+             &actual_float_result.Storage());
+  std::cout
+      << "Here is the actual float product (LHS * RHS) matrix obtained by "
+      << "dequantizing the above uint8 result, i.e. "
+      << "as far as we are concerned, the ACTUAL RESULT:\n"
+      << actual_float_result << std::endl;
+
+  MatrixWithStorage<float, kOrder> diff_float_result(rows, cols);
+  for (int i = 0; i < rows; i++) {
+    for (int j = 0; j < cols; j++) {
+      diff_float_result.Map()(i, j) =
+          actual_float_result.Map()(i, j) - reference_float_result.Map()(i, j);
+    }
+  }
+
+  std::cout << "Difference between ACTUAL and REFERENCE float results:\n"
+            << diff_float_result << std::endl;
+}
\ No newline at end of file
diff --git a/eight_bit_int_gemm/eight_bit_int_gemm.cc b/eight_bit_int_gemm/eight_bit_int_gemm.cc
index ecea180..8113bf3 100644
--- a/eight_bit_int_gemm/eight_bit_int_gemm.cc
+++ b/eight_bit_int_gemm/eight_bit_int_gemm.cc
@@ -12,6 +12,9 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
+#ifndef GEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK
+#define GEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK
+#endif
 #include "eight_bit_int_gemm.h"
 
 #include <memory>
@@ -30,8 +33,14 @@
 // is quite significant (approx. 200kb) which might be prohibitive in
 // low-memory situations.
 
-#if defined(GEMMLOWP_USE_META_FASTPATH) && defined(GEMMLOWP_NEON_32)
-#include "../meta/multi_thread_gemm.h"
+#if defined(GEMMLOWP_USE_META_FASTPATH) && defined(GEMMLOWP_NEON)
+#include "../meta/legacy_multi_thread_gemm.h"
+#else
+
+#if defined(GEMMLOWP_USE_META_FASTPATH)
+#warning "META fast path turned on without NEON!"
+#endif
+
 #endif
 
 namespace gemmlowp {
@@ -136,25 +145,31 @@
 
 class Scratch {
  public:
-  Scratch() : buffer_(), size_(0) {}
+  Scratch() : buffer_(), buffer_32_(nullptr), size_(0) {}
 
   void AssureSize(std::int32_t required_size) {
     if (size_ >= required_size) {
       return;
     }
-    buffer_.reset(new std::uint8_t[required_size]);
+    buffer_.reset(new std::uint8_t[required_size + 32]);
+    buffer_32_ =
+        buffer_.get() +
+        ((32 - (reinterpret_cast<uintptr_t>(buffer_.get()) % 32)) % 32);
+    assert((reinterpret_cast<uintptr_t>(buffer_32_) % 32) == 0);
     size_ = required_size;
   }
 
   void Clear() {
     buffer_.reset(nullptr);
+    buffer_32_ = nullptr;
     size_ = 0;
   }
 
-  std::uint8_t* buffer() { return buffer_.get(); }
+  std::uint8_t* buffer() { return buffer_32_; }
 
  private:
   std::unique_ptr<std::uint8_t[]> buffer_;
+  std::uint8_t* buffer_32_;
   std::int32_t size_;
 };
 
@@ -172,7 +187,7 @@
   global_scratch = nullptr;
 }
 
-#if defined(GEMMLOWP_USE_META_FASTPATH) && defined(GEMMLOWP_NEON_32)
+#if defined(GEMMLOWP_USE_META_FASTPATH) && defined(GEMMLOWP_NEON)
 
 bool IsRowMajorOrVector(bool transpose, int stride, int rows, int cols) {
   // Is it row major and nicely packed?
@@ -205,8 +220,8 @@
 bool CanHandleMetaFastpath(bool transpose_a, bool transpose_b, bool transpose_c,
                            int m, int n, int k, int lda, int ldb, int ldc,
                            BitDepthSetting depth_setting) {
-  // Meta fastpath only supports 8bit x 8bit and k up to 2048.
-  if (depth_setting != BitDepthSetting::A8B8 || k > 2048) {
+  // Meta fastpath only supports 8bit x 8bit and k between 8 and 2048.
+  if (depth_setting != BitDepthSetting::A8B8 || k < 8 || k > 2048) {
     return false;
   }
 
@@ -242,20 +257,19 @@
                            std::int32_t shift, bool result_transpose,
                            std::int32_t result_stride, std::uint8_t* result) {
   Scratch* scratch = GetOrCreateGlobalScratch();
+  const std::int32_t max_num_threads = context->max_num_threads();
   if (IsRowMajorOrVector(result_transpose, result_stride, m, n)) {
-    scratch->AssureSize(
-        meta::gemm_q8_scratch(m, n, k, context->max_num_threads()));
-    meta::multi_thread_gemm_q8(
-        context->workers_pool(), context->max_num_threads(), scratch->buffer(),
-        lhs, rhs, m, n, k, lhs_offset, rhs_offset, sum_offset,
-        multiplicative_offset, shift, result);
+    scratch->AssureSize(meta::gemm_q8_scratch(m, n, k, max_num_threads));
+    meta::multi_thread_gemm_q8(context->workers_pool(), max_num_threads,
+                               scratch->buffer(), lhs, rhs, m, n, k, lhs_offset,
+                               rhs_offset, sum_offset, multiplicative_offset,
+                               shift, result);
   } else {
-    scratch->AssureSize(
-        meta::gemm_q8_scratch(n, m, k, context->max_num_threads()));
-    meta::multi_thread_gemm_q8(
-        context->workers_pool(), context->max_num_threads(), scratch->buffer(),
-        rhs, lhs, n, m, k, rhs_offset, lhs_offset, sum_offset,
-        multiplicative_offset, shift, result);
+    scratch->AssureSize(meta::gemm_q8_scratch(n, m, k, max_num_threads));
+    meta::multi_thread_gemm_q8(context->workers_pool(), max_num_threads,
+                               scratch->buffer(), rhs, lhs, n, m, k, rhs_offset,
+                               lhs_offset, sum_offset, multiplicative_offset,
+                               shift, result);
   }
 }
 
@@ -267,18 +281,17 @@
                    float result_offset, bool result_transpose,
                    std::int32_t result_stride, float* result) {
   Scratch* scratch = GetOrCreateGlobalScratch();
+  const std::int32_t max_num_threads = context->max_num_threads();
   if (IsRowMajorOrVector(result_transpose, result_stride, m, n)) {
-    scratch->AssureSize(
-        meta::gemm_f_scratch(m, n, k, context->max_num_threads()));
-    meta::multi_thread_gemm_f(
-        context->workers_pool(), context->max_num_threads(), scratch->buffer(),
-        lhs, rhs, m, n, k, lhs_offset, rhs_offset, result_offset, result);
+    scratch->AssureSize(meta::gemm_f_scratch(m, n, k, max_num_threads));
+    meta::multi_thread_gemm_f(context->workers_pool(), max_num_threads,
+                              scratch->buffer(), lhs, rhs, m, n, k, lhs_offset,
+                              rhs_offset, result_offset, result);
   } else {
-    scratch->AssureSize(
-        meta::gemm_f_scratch(n, m, k, context->max_num_threads()));
-    meta::multi_thread_gemm_f(
-        context->workers_pool(), context->max_num_threads(), scratch->buffer(),
-        rhs, lhs, n, m, k, rhs_offset, lhs_offset, result_offset, result);
+    scratch->AssureSize(meta::gemm_f_scratch(n, m, k, max_num_threads));
+    meta::multi_thread_gemm_f(context->workers_pool(), max_num_threads,
+                              scratch->buffer(), rhs, lhs, n, m, k, rhs_offset,
+                              lhs_offset, result_offset, result);
   }
 }
 
@@ -297,7 +310,7 @@
   AutoGlobalLock<EightBitIntGemmLockId> lock;
   GemmContext* context = GetOrCreateGlobalContext();
 
-#if defined(GEMMLOWP_USE_META_FASTPATH) && defined(GEMMLOWP_NEON_32)
+#if defined(GEMMLOWP_USE_META_FASTPATH) && defined(GEMMLOWP_NEON)
   if (CanHandleMetaFastpath(transpose_a, transpose_b, transpose_c, m, n, k, lda,
                             ldb, ldc, bit_depth)) {
     MetaGemmQuantized8Bit(context, a, b, m, n, k, a_offset, b_offset, c_offset,
@@ -334,7 +347,7 @@
   AutoGlobalLock<EightBitIntGemmLockId> lock;
   GemmContext* context = GetOrCreateGlobalContext();
 
-#if defined(GEMMLOWP_USE_META_FASTPATH) && defined(GEMMLOWP_NEON_32)
+#if defined(GEMMLOWP_USE_META_FASTPATH) && defined(GEMMLOWP_NEON)
   if (CanHandleMetaFastpath(transpose_a, transpose_b, transpose_c, m, n, k, lda,
                             ldb, ldc, bit_depth)) {
     MetaGemmFloat(context, a, b, m, n, k, a_offset, b_offset, c_offset,
diff --git a/eight_bit_int_gemm/eight_bit_int_gemm.h b/eight_bit_int_gemm/eight_bit_int_gemm.h
index 6bd9dfe..6bda427 100644
--- a/eight_bit_int_gemm/eight_bit_int_gemm.h
+++ b/eight_bit_int_gemm/eight_bit_int_gemm.h
@@ -24,8 +24,6 @@
 namespace std {
 using ::uint8_t;
 using ::int32_t;
-using ::int64_t;
-using ::uint64_t;
 }
 #endif
 
@@ -46,7 +44,7 @@
 // Users who prefer a state-less, singleton-less interface,
 // should use the main gemmlowp interface (public/gemmlowp.h) instead.
 
-// The main entry point to compute a Gemm. This is the standard
+// The BitDepthSetting enum lists supported a/b bit-depth combinations.
 enum class BitDepthSetting {
   A8B8,  // 8-bit a, 8-bit b
   A5B7   // 5-bit a, 7-bit b
diff --git a/fixedpoint/fixedpoint.h b/fixedpoint/fixedpoint.h
new file mode 100644
index 0000000..e21337f
--- /dev/null
+++ b/fixedpoint/fixedpoint.h
@@ -0,0 +1,779 @@
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// fixedpoint.h: fixed-point arithmetic, with basic operations and
+// a few math functions such as tanh.
+
+#ifndef GEMMLOWP_INTERNAL_FIXEDPOINT_H_
+#define GEMMLOWP_INTERNAL_FIXEDPOINT_H_
+
+#include <cassert>
+#include <limits>
+
+#include "../internal/common.h"
+
+namespace gemmlowp {
+
+// Part 1: Low-level integer-arithmetic primitives.
+// The implementations here are generic implementations valid for
+// scalar types (e.g. std::int32_t). Architecture-specific SIMD types
+// (e.g. NEON int32x4_t) may be supported by providing
+// specializations for them in separate files.
+//
+// The purpose of these primitives is two-fold:
+//  - They will be used to implement higher-level fixed-point
+//    abstractions, namely the FixedPoint class and its arithmetic
+//    operators.
+//  - They will be directly used to implement some more involved
+//    fixed-point computations, e.g. the fixed-point implementation
+//    of math functions such as tanh.
+
+// Some compile-time traits around raw types to handle SIMD aspects:
+// number of lanes, underlying scalar type.
+template <typename tIntegerType>
+struct FixedPointRawTypeTraits {};
+
+template <>
+struct FixedPointRawTypeTraits<std::int32_t> {
+  typedef std::int32_t ScalarRawType;
+  static const int kLanes = 1;
+};
+
+// Returns a SIMD value duplicating a scalar value across all lanes.
+template <typename tRawType>
+tRawType Dup(typename FixedPointRawTypeTraits<tRawType>::ScalarRawType x) {
+  return x;
+}
+
+// Plain bit-wise AND
+template <typename tIntegerType>
+tIntegerType BitAnd(tIntegerType a, tIntegerType b) {
+  return a & b;
+}
+
+// Plain bit-wise OR
+template <typename tIntegerType>
+tIntegerType BitOr(tIntegerType a, tIntegerType b) {
+  return a | b;
+}
+
+// Plain bit-wise XOR
+template <typename tIntegerType>
+tIntegerType BitXor(tIntegerType a, tIntegerType b) {
+  return a ^ b;
+}
+
+// Plain bit-wise NOT
+template <typename tIntegerType>
+tIntegerType BitNot(tIntegerType a) {
+  return ~a;
+}
+
+// Integer addition. Not saturating. Overflow is undefined behavior.
+template <typename tIntegerType>
+tIntegerType Add(tIntegerType a, tIntegerType b) {
+  return a + b;
+}
+
+// Integer subtraction. Not saturating. Overflow is undefined behavior.
+template <typename tIntegerType>
+tIntegerType Mul(tIntegerType a, tIntegerType b) {
+  return a * b;
+}
+
+template <typename tIntegerType>
+tIntegerType Sub(tIntegerType a, tIntegerType b) {
+  return a - b;
+}
+
+// Integer unary negative. Not saturating. Overflow is undefined behavior.
+template <typename tIntegerType>
+tIntegerType Neg(tIntegerType a) {
+  return -a;
+}
+
+// Integer arithmetic left-shift, equivalent to multiplying with a
+// power of two. Not saturating. Overflow is undefined behavior.
+template <typename tIntegerType>
+tIntegerType ShiftLeft(tIntegerType a, int offset) {
+  return a << offset;
+}
+
+// Integer arithmetic right-shift. Not rounding.
+// Relying on implementation-defined, but in-practice-consistent,
+// C++ compiler behavior.
+template <typename tIntegerType>
+tIntegerType ShiftRight(tIntegerType a, int offset) {
+  return a >> offset;
+}
+
+// Each bit of the result is set to the corresponding bit of either then_val or
+// else_val depending on whether the corresponding bit of if_mask is set.
+// Equivalent to the VBSL instruction in ARM NEON.
+template <typename tIntegerType>
+tIntegerType SelectUsingMask(tIntegerType if_mask, tIntegerType then_val,
+                             tIntegerType else_val) {
+  return BitXor(BitAnd(if_mask, then_val), BitAnd(BitNot(if_mask), else_val));
+}
+
+// For each input scalar, the corresponding bits of the result are set if the
+// input scalar is non-zero.
+template <typename tIntegerType>
+tIntegerType MaskIfNonZero(tIntegerType a) {
+  static const tIntegerType zero = 0;
+  return a ? BitNot(zero) : zero;
+}
+
+// For each input scalar, the corresponding bits of the result are set if the
+// input scalar is zero.
+template <typename tIntegerType>
+tIntegerType MaskIfZero(tIntegerType a) {
+  return MaskIfNonZero<tIntegerType>(!a);
+}
+
+// For each pair of input scalars, the corresponding bits of the result are
+// set if the input scalars are equal.
+template <typename tIntegerType>
+tIntegerType MaskIfEqual(tIntegerType a, tIntegerType b) {
+  return MaskIfNonZero<tIntegerType>(a == b);
+}
+
+// For each pair of input scalars, the corresponding bits of the result are
+// set if the input scalars are not equal.
+template <typename tIntegerType>
+tIntegerType MaskIfNotEqual(tIntegerType a, tIntegerType b) {
+  return MaskIfNonZero<tIntegerType>(a != b);
+}
+
+// For each pair of input scalars, the corresponding bits of the result are
+// set if the input scalars a, b satisfy a > b.
+template <typename tIntegerType>
+tIntegerType MaskIfGreaterThan(tIntegerType a, tIntegerType b) {
+  return MaskIfNonZero<tIntegerType>(a > b);
+}
+
+// For each pair of input scalars, the corresponding bits of the result are
+// set if the input scalars a, b satisfy a >= b.
+template <typename tIntegerType>
+tIntegerType MaskIfGreaterThanOrEqual(tIntegerType a, tIntegerType b) {
+  return MaskIfNonZero<tIntegerType>(a >= b);
+}
+
+// For each pair of input scalars, the corresponding bits of the result are
+// set if the input scalars a, b satisfy a < b.
+template <typename tIntegerType>
+tIntegerType MaskIfLessThan(tIntegerType a, tIntegerType b) {
+  return MaskIfNonZero<tIntegerType>(a < b);
+}
+
+// For each pair of input scalars, the corresponding bits of the result are
+// set if the input scalars a, b satisfy a <= b.
+template <typename tIntegerType>
+tIntegerType MaskIfLessThanOrEqual(tIntegerType a, tIntegerType b) {
+  return MaskIfNonZero<tIntegerType>(a <= b);
+}
+
+// Returns true if all of the input scalars are nonzero.
+// This function may currently assume that each of the input scalars has either
+// all or none of its bits set. Otherwise, its behavior is currently undefined.
+template <typename tIntegerType>
+bool All(tIntegerType a) {
+  return a;
+}
+
+// Returns true if any of the input scalars are nonzero.
+// This function may currently assume that each of the input scalars has either
+// all or none of its bits set. Otherwise, its behavior is currently undefined.
+template <typename tIntegerType>
+bool Any(tIntegerType a) {
+  return a;
+}
+
+// Returns (a+b)/2, rounded to the nearest integer.
+// Equivalent to VRHADD in the ARM NEON instruction set.
+template <typename IntegerType>
+IntegerType RoundingHalfSum(IntegerType a, IntegerType b) {
+  static_assert(std::is_same<IntegerType, void>::value, "unimplemented");
+  return a;
+}
+
+template <>
+inline std::int32_t RoundingHalfSum(std::int32_t a, std::int32_t b) {
+  std::int64_t a64 = a;
+  std::int64_t b64 = b;
+  std::int64_t sum = a64 + b64;
+  std::int64_t sign = sum >= 0 ? 1 : -1;
+  return static_cast<std::int32_t>((sum + sign) / 2);
+}
+
+// Returns the integer that represents the product of two fixed-point
+// numbers, interpreting all integers as fixed-point values in the
+// interval [-1, 1), rounding to the nearest value, and saturating
+// -1 * -1 to the maximum value (since 1 is not in the half-open
+// interval [-1, 1)).
+//
+// [The explanation below specializes to std::int32_t for example purpose.]
+//
+// The mapping between IntegerType and the interval [-1, 1) is unique and
+// implied by IntegerType, which is assumed to be signed. For example,
+// for IntegerType==std::int32_t, the mapping is
+//   real_value = integer_value / 2^31.
+// So in this case, and leaving aside rounding and saturating, this
+// function computes ((a / 2^31) * (b / 2^31)) * 2^31, which simplifies to
+//   (a * b) / 2^31.
+//
+// The 'doubling' part in the name of this function comes from the fact that
+// this operation is very close to a "multiply-high" operation, keeping only
+// the top half bits, except that that would be effectively computing
+//   (a * b) / 2^32,
+// so here we are computing 2x that, since
+//   1/2^31 = 2 * 1/2^32.
+// The idea is to use all of the available 32 bits in the destination int32
+// value.
+//
+// [End of the explanation specializing to int32.]
+//
+// This is equivalent to the VQRDMULH instruction in ARM NEON.
+template <typename IntegerType>
+IntegerType SaturatingRoundingDoublingHighMul(IntegerType a, IntegerType b) {
+  static_assert(std::is_same<IntegerType, void>::value, "unimplemented");
+  return a;
+}
+
+// This function implements the same computation as the ARMv7 NEON VQRDMULH
+// instruction.
+template <>
+inline std::int32_t SaturatingRoundingDoublingHighMul(std::int32_t a,
+                                                      std::int32_t b) {
+  bool overflow = a == b && a == std::numeric_limits<std::int32_t>::min();
+  std::int64_t a_64(a);
+  std::int64_t b_64(b);
+  std::int64_t ab_64 = a_64 * b_64;
+  std::int32_t nudge = ab_64 >= 0 ? (1 << 30) : (1 - (1 << 30));
+  std::int32_t ab_x2_high32 =
+      static_cast<std::int32_t>((ab_64 + nudge) / (1ll << 31));
+  return overflow ? std::numeric_limits<std::int32_t>::max() : ab_x2_high32;
+}
+
+// Correctly-rounded-to-nearest division by a power-of-two.
+// Also known as a rounding arithmetic right shift.
+template <typename IntegerType>
+inline IntegerType RoundingDivideByPOT(IntegerType x, int exponent) {
+  using ScalarIntegerType =
+      typename FixedPointRawTypeTraits<IntegerType>::ScalarRawType;
+  static_assert(std::is_same<ScalarIntegerType, std::int32_t>::value,
+                "Currently only supporting int32 scalar and SIMD types");
+  assert(exponent >= 0);
+  assert(exponent <= 31);
+  const IntegerType mask = Dup<IntegerType>((1ll << exponent) - 1);
+  const IntegerType zero = Dup<IntegerType>(0);
+  const IntegerType one = Dup<IntegerType>(1);
+  const IntegerType remainder = BitAnd(x, mask);
+  const IntegerType threshold =
+      Add(ShiftRight(mask, 1), BitAnd(MaskIfLessThan(x, zero), one));
+  return Add(ShiftRight(x, exponent),
+             BitAnd(MaskIfGreaterThan(remainder, threshold), one));
+}
+
+// Returns the product of a run-time integer value by a compile-time power
+// of two, with either a positive exponent (equivalent to an arithmetic
+// left shift, saturating) or a negative exponent (equivalent to an arithmetic
+// right shift, rounding to nearest).
+template <int Exponent, typename IntegerType,
+          int ExponentSign = (Exponent > 0 ? 1 : Exponent < 0 ? -1 : 0)>
+struct ImplSaturatingRoundingMultiplyByPOT {};
+
+template <int Exponent, typename IntegerType>
+struct ImplSaturatingRoundingMultiplyByPOT<Exponent, IntegerType, 0> {
+  static IntegerType eval(IntegerType x) { return x; }
+};
+
+template <int Exponent, typename IntegerType>
+struct ImplSaturatingRoundingMultiplyByPOT<Exponent, IntegerType, 1> {
+  static IntegerType eval(IntegerType x) {
+    using ScalarIntegerType =
+        typename FixedPointRawTypeTraits<IntegerType>::ScalarRawType;
+    static_assert(std::is_same<ScalarIntegerType, std::int32_t>::value,
+                  "Currently only supporting int32 scalar and SIMD types");
+    const IntegerType min =
+        Dup<IntegerType>(std::numeric_limits<std::int32_t>::min());
+    const IntegerType max =
+        Dup<IntegerType>(std::numeric_limits<std::int32_t>::max());
+
+    const std::int32_t threshold = ((1 << (31 - Exponent)) - 1);
+    const IntegerType positive_mask =
+        MaskIfGreaterThan(x, Dup<IntegerType>(threshold));
+    const IntegerType negative_mask =
+        MaskIfLessThan(x, Dup<IntegerType>(-threshold));
+
+    IntegerType result = ShiftLeft(x, Exponent);
+    result = SelectUsingMask(positive_mask, max, result);
+    result = SelectUsingMask(negative_mask, min, result);
+    return result;
+  }
+};
+
+template <int Exponent, typename IntegerType>
+struct ImplSaturatingRoundingMultiplyByPOT<Exponent, IntegerType, -1> {
+  static IntegerType eval(IntegerType x) {
+    return RoundingDivideByPOT<IntegerType>(x, -Exponent);
+  }
+};
+
+template <int Exponent, typename IntegerType>
+IntegerType SaturatingRoundingMultiplyByPOT(IntegerType x) {
+  return ImplSaturatingRoundingMultiplyByPOT<Exponent, IntegerType>::eval(x);
+}
+
+// Part 2: the FixedPoint class.
+
+// A FixedPoint object represents a fixed-point value stored in the underlying
+// integer type tRawType, if tRawType is a plain scalar integer type.
+// Alternatively, tRawType may be a SIMD type (e.g. NEON int32x4_t) in which
+// case a FixedPoint object represents a corresponding SIMD vector of fixed
+// point values.
+//
+// tIntegerBits describes the range of the fixed-point format: if
+// tIntegerBits == m then the range of representable values is the half-open
+// interval [-2^m; 2^m) where the open boundary on the right side means that
+// 2^m is not representable (how close the maximum representable value is to
+// it, depends on bit-depth of tRawType).
+//
+// In "Q format notation",
+//   https://en.wikipedia.org/wiki/Q_(number_format)
+// we are describing the format
+//   Qm.n
+// where
+//   m = tIntegerBits
+// and
+//   n = NumberOfBits(tRawType) - (m + 1)
+// Note that the (m + 1) in the above line is because we adopt the convention
+// that we count the integer bits exclusively of the sign bit; so (m + 1) is
+// the total number of integer bits inclusive of the sign bit.
+//
+// Accordingly, the number of integral representable values in our range
+//   [-2^m ; 2^m)
+// is equal to 2^(m+1).
+template <typename tRawType, int tIntegerBits>
+class FixedPoint {
+ public:
+  typedef tRawType RawType;
+
+  typedef FixedPointRawTypeTraits<RawType> RawTypeTraits;
+  typedef typename RawTypeTraits::ScalarRawType ScalarRawType;
+
+  static const int kTotalBits = 8 * sizeof(ScalarRawType);
+  static const int kIntegerBits = tIntegerBits;
+  static const int kFractionalBits = kTotalBits - 1 - kIntegerBits;
+  static_assert(kIntegerBits >= 0 && kIntegerBits < kTotalBits,
+                "bad IntegerBits");
+
+  typedef FixedPoint<ScalarRawType, kIntegerBits> ScalarFixedPointType;
+
+  static const ScalarRawType ScalarRawMin() {
+    return std::numeric_limits<ScalarRawType>::min();
+  }
+
+  static const ScalarRawType ScalarRawMax() {
+    return std::numeric_limits<ScalarRawType>::max();
+  }
+
+  static const ScalarRawType RawMin() {
+    return VectorFromScalar(ScalarRawMin());
+  }
+
+  static const ScalarRawType RawMax() {
+    return VectorFromScalar(ScalarRawMax());
+  }
+
+  static FixedPoint FromRaw(RawType x) {
+    FixedPoint retval;
+    retval.raw() = x;
+    return retval;
+  }
+
+  static FixedPoint FromScalarRaw(ScalarRawType x) {
+    FixedPoint retval;
+    retval.raw() = Dup<RawType>(x);
+    return retval;
+  }
+
+  static FixedPoint FromScalarFixedPoint(ScalarFixedPointType x) {
+    return FromScalarRaw(x.raw());
+  }
+
+  template <int Exponent>
+  static FixedPoint ConstantPOT() {
+    static const int kOffset = kFractionalBits + Exponent;
+    static_assert(
+        kOffset < 31,
+        "Constant not exactly representable in this fixed-point format");
+    return FromScalarRaw(ScalarRawType(1) << kOffset);
+  }
+
+  static FixedPoint Zero() { return FromScalarRaw(0); }
+
+  static FixedPoint One() {
+    return FromScalarRaw(kIntegerBits == 0
+                             ? ScalarRawMax()
+                             : (ScalarRawType(1) << kFractionalBits));
+  }
+
+  static FixedPoint FromDouble(double x) {
+    const double min_bound = static_cast<double>(ScalarRawMin());
+    const double max_bound = static_cast<double>(ScalarRawMax());
+    return FromScalarRaw(static_cast<std::int32_t>(std::min(
+        std::max(round(x * static_cast<double>(1ll << kFractionalBits)),
+                 min_bound),
+        max_bound)));
+  }
+
+  RawType raw() const { return i_; }
+  RawType& raw() { return i_; }
+
+ private:
+  RawType i_;
+};
+
+// Part 3: implementation of arithmetic operators for the
+// FixedPoint class, and a few related functions.
+
+// A FixedPoint multiplication is just a
+// SaturatingRoundingDoublingHighMul operation on the underlying
+// raw integer values. The IntegerBits simply add up, as is obvious
+// from the fact that the range is [-2^IntegerBits, 2^IntegerBits).
+template <typename tRawType, int tIntegerBits_a, int tIntegerBits_b>
+FixedPoint<tRawType, tIntegerBits_a + tIntegerBits_b> operator*(
+    FixedPoint<tRawType, tIntegerBits_a> a,
+    FixedPoint<tRawType, tIntegerBits_b> b) {
+  FixedPoint<tRawType, tIntegerBits_a + tIntegerBits_b> c;
+  c.raw() = SaturatingRoundingDoublingHighMul(a.raw(), b.raw());
+  return c;
+}
+
+// Tweaking IntegerBits gives exact multiplication by a power of two.
+template <int tExponent, typename tRawType, int tIntegerBits>
+FixedPoint<tRawType, tExponent + tIntegerBits> ExactMulByPot(
+    FixedPoint<tRawType, tIntegerBits> a) {
+  FixedPoint<tRawType, tExponent + tIntegerBits> c;
+  c.raw() = a.raw();
+  return c;
+}
+
+// If we want to leave IntegerBits fixed, then multiplication
+// by a power of two has to be saturating/rounding, not exact anymore.
+template <int tExponent, typename tRawType, int tIntegerBits>
+FixedPoint<tRawType, tIntegerBits> SaturatingRoundingMultiplyByPOT(
+    FixedPoint<tRawType, tIntegerBits> a) {
+  return FixedPoint<tRawType, tIntegerBits>::FromRaw(
+      SaturatingRoundingMultiplyByPOT<tExponent>(a.raw()));
+}
+
+// Generic arithmetic operators.
+
+#define MAKE_FIXEDPOINT_UNARY_FUNC(FuncName, ImplFuncName)                     \
+  template <typename tRawType, int tIntegerBits>                               \
+  FixedPoint<tRawType, tIntegerBits> FuncName(                                 \
+      FixedPoint<tRawType, tIntegerBits> a) {                                  \
+    return FixedPoint<tRawType, tIntegerBits>::FromRaw(ImplFuncName(a.raw())); \
+  }
+
+#define MAKE_FIXEDPOINT_BINARY_FUNC(FuncName, ImplFuncName) \
+  template <typename tRawType, int tIntegerBits>            \
+  FixedPoint<tRawType, tIntegerBits> FuncName(              \
+      FixedPoint<tRawType, tIntegerBits> a,                 \
+      FixedPoint<tRawType, tIntegerBits> b) {               \
+    return FixedPoint<tRawType, tIntegerBits>::FromRaw(     \
+        ImplFuncName(a.raw(), b.raw()));                    \
+  }
+
+MAKE_FIXEDPOINT_UNARY_FUNC(operator-, Neg)
+MAKE_FIXEDPOINT_UNARY_FUNC(operator~, BitNot)
+MAKE_FIXEDPOINT_BINARY_FUNC(operator+, Add)
+MAKE_FIXEDPOINT_BINARY_FUNC(operator-, Sub)
+MAKE_FIXEDPOINT_BINARY_FUNC(operator&, BitAnd)
+MAKE_FIXEDPOINT_BINARY_FUNC(operator^, BitXor)
+MAKE_FIXEDPOINT_BINARY_FUNC(operator|, BitOr)
+MAKE_FIXEDPOINT_BINARY_FUNC(RoundingHalfSum, RoundingHalfSum)
+
+#undef MAKE_FIXEDPOINT_UNARY_FUNC
+#undef MAKE_FIXEDPOINT_BINARY_FUNC
+
+#define MAKE_FIXEDPOINT_UNARY_FUNC_RETURNING_RAW(FuncName)  \
+  template <typename tRawType, int tIntegerBits>            \
+  tRawType FuncName(FixedPoint<tRawType, tIntegerBits> a) { \
+    return FuncName(a.raw());                               \
+  }
+
+#define MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(FuncName) \
+  template <typename tRawType, int tIntegerBits>            \
+  tRawType FuncName(FixedPoint<tRawType, tIntegerBits> a,   \
+                    FixedPoint<tRawType, tIntegerBits> b) { \
+    return FuncName(a.raw(), b.raw());                      \
+  }
+
+MAKE_FIXEDPOINT_UNARY_FUNC_RETURNING_RAW(MaskIfZero)
+MAKE_FIXEDPOINT_UNARY_FUNC_RETURNING_RAW(MaskIfNonZero)
+MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfEqual)
+MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfNotEqual)
+MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfGreaterThan)
+MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfGreaterThanOrEqual)
+MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfLessThan)
+MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW(MaskIfLessThanOrEqual)
+
+#undef MAKE_FIXEDPOINT_UNARY_FUNC_RETURNING_RAW
+#undef MAKE_FIXEDPOINT_BINARY_FUNC_RETURNING_RAW
+
+template <typename tRawType, int tIntegerBits>
+FixedPoint<tRawType, tIntegerBits> SelectUsingMask(
+    tRawType if_mask, FixedPoint<tRawType, tIntegerBits> then_val,
+    FixedPoint<tRawType, tIntegerBits> else_val) {
+  return FixedPoint<tRawType, tIntegerBits>::FromRaw(
+      SelectUsingMask(if_mask, then_val.raw(), else_val.raw()));
+}
+
+template <typename tRawType, int tIntegerBits>
+bool operator==(FixedPoint<tRawType, tIntegerBits> a,
+                FixedPoint<tRawType, tIntegerBits> b) {
+  return All(MaskIfEqual(a.raw(), b.raw()));
+}
+
+template <typename tRawType, int tIntegerBits>
+bool operator!=(FixedPoint<tRawType, tIntegerBits> a,
+                FixedPoint<tRawType, tIntegerBits> b) {
+  return !(a == b);
+}
+
+// Conversion to floating-point.
+template <typename tRawType, int tIntegerBits>
+double ToDouble(FixedPoint<tRawType, tIntegerBits> x) {
+  static_assert(FixedPointRawTypeTraits<tRawType>::kLanes == 1,
+                "not applicable to SIMD types");
+  typedef FixedPoint<tRawType, tIntegerBits> F;
+  return x.raw() / static_cast<double>(1ll << F::kFractionalBits);
+}
+
+// Rescale changes the number of IntegerBits and updates the underlying
+// raw integer value accordingly.
+template <int tIntegerBitsDst, typename tRawType, int tIntegerBitsSrc>
+FixedPoint<tRawType, tIntegerBitsDst> Rescale(
+    FixedPoint<tRawType, tIntegerBitsSrc> x) {
+  static const int kExponent = tIntegerBitsSrc - tIntegerBitsDst;
+  FixedPoint<tRawType, tIntegerBitsDst> result;
+  result.raw() = SaturatingRoundingMultiplyByPOT<kExponent>(x.raw());
+  return result;
+}
+
+// CheckedFixedPointConstant allows to specify fixed-point constants
+// initialized as real numbers, in a way that does not compile floating-point
+// arithmetic in production code, yet still checks agreement with the
+// floating-point expressions when asserts are enabled.
+#ifdef GEMMLOWP_ENABLE_FIXEDPOINT_CONSTANTS_CHECKS
+template <typename FixedPointType>
+FixedPointType CheckedFixedPointConstant(
+    typename FixedPointType::ScalarRawType raw_value, double double_value) {
+  typedef typename FixedPointType::RawType RawType;
+  const FixedPointType result = FixedPointType::FromScalarRaw(raw_value);
+  assert(result == FixedPointType::FromDouble(double_value));
+  return result;
+}
+#define GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(FixedPointType, ScalarRawValue, \
+                                             DoubleValue)                    \
+  (CheckedFixedPointConstant<FixedPointType>(ScalarRawValue, DoubleValue))
+
+#else
+#define GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(FixedPointType, ScalarRawValue, \
+                                             DoubleValue)                    \
+  (FixedPointType::FromScalarRaw(ScalarRawValue))
+#endif
+
+// Implementation of exponential function.
+
+// Returns exp(x) for x in [-1/4, 0).
+template <typename tRawType>
+FixedPoint<tRawType, 0> exp_on_interval_between_negative_one_quarter_and_0_excl(
+    FixedPoint<tRawType, 0> a) {
+  typedef FixedPoint<tRawType, 0> F;
+  const F constant_term =
+      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F, 1895147668, std::exp(-1.0 / 8.0));
+  const F constant_1_over_3 =
+      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F, 715827883, 1.0 / 3.0);
+  // We're evaluating a Taylor expansion around -1/8, so we do the change of
+  // variable: x = a + 1/8.
+  // In fixed-point with 0 integer bits, 1/8 is represented by 1 << 28.
+  F x = a + F::template ConstantPOT<-3>();
+  F x2 = x * x;
+  F x3 = x2 * x;
+  F x4 = x2 * x2;
+  F x4_over_4 = SaturatingRoundingMultiplyByPOT<-2>(x4);
+  F x4_over_24_plus_x3_over_6_plus_x2_over_2 =
+      SaturatingRoundingMultiplyByPOT<-1>(
+          ((x4_over_4 + x3) * constant_1_over_3) + x2);
+  return constant_term +
+         constant_term * (x + x4_over_24_plus_x3_over_6_plus_x2_over_2);
+}
+
+// Returns exp(x) for x < 0.
+template <typename tRawType, int tIntegerBits>
+FixedPoint<tRawType, 0> exp_on_negative_values(
+    FixedPoint<tRawType, tIntegerBits> a) {
+  typedef FixedPoint<tRawType, tIntegerBits> InputF;
+  typedef FixedPoint<tRawType, 0> ResultF;
+  static const int kFractionalBits = InputF::kFractionalBits;
+  static const int kIntegerBits = InputF::kIntegerBits;
+  static const InputF kOneQuarter = InputF::template ConstantPOT<-2>();
+  InputF mask = kOneQuarter - InputF::FromScalarRaw(1);
+  InputF a_mod_quarter_minus_one_quarter = (a & mask) - kOneQuarter;
+  ResultF result = exp_on_interval_between_negative_one_quarter_and_0_excl(
+      Rescale<0>(a_mod_quarter_minus_one_quarter));
+  tRawType remainder = (a_mod_quarter_minus_one_quarter - a).raw();
+
+#define GEMMLOWP_EXP_BARREL_SHIFTER(Exponent, FixedPointMultiplier)         \
+  if (kIntegerBits > Exponent) {                                            \
+    const ResultF kMultiplier = GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(       \
+        ResultF, FixedPointMultiplier, std::exp(-std::pow(2.0, Exponent))); \
+    static constexpr int kShiftAmount =                                     \
+        kIntegerBits > Exponent ? kFractionalBits + Exponent : 0;           \
+    result = SelectUsingMask(                                               \
+        MaskIfNonZero(BitAnd(remainder, Dup<tRawType>(1 << kShiftAmount))), \
+        result * kMultiplier, result);                                      \
+  }
+
+  GEMMLOWP_EXP_BARREL_SHIFTER(-2, 1672461947);
+  GEMMLOWP_EXP_BARREL_SHIFTER(-1, 1302514674);
+  GEMMLOWP_EXP_BARREL_SHIFTER(+0, 790015084);
+  GEMMLOWP_EXP_BARREL_SHIFTER(+1, 290630308);
+  GEMMLOWP_EXP_BARREL_SHIFTER(+2, 39332535);
+  GEMMLOWP_EXP_BARREL_SHIFTER(+3, 720401);
+  GEMMLOWP_EXP_BARREL_SHIFTER(+4, 242);
+
+#undef GEMMLOWP_EXP_BARREL_SHIFTER
+
+  if (kIntegerBits > 5) {
+    static const int b = kIntegerBits > 5 ? kFractionalBits + 5 : 0;
+    const InputF clamp =
+        GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(InputF, -(1 << b), -32.0);
+    result = SelectUsingMask(MaskIfLessThan(a, clamp), ResultF::Zero(), result);
+  }
+
+  result = SelectUsingMask(MaskIfZero(a), ResultF::One(), result);
+  return result;
+}
+
+// Implementation of tanh: (1 - exp(-2x)) / (1 + exp(-2x)).
+
+// Returns (1 - x) / (1 + x) for x in (0, 1).
+template <typename tRawType>
+FixedPoint<tRawType, 0> one_minus_x_over_one_plus_x_for_x_in_0_1(
+    FixedPoint<tRawType, 0> a) {
+  typedef FixedPoint<tRawType, 0> F0;
+  typedef FixedPoint<tRawType, 2> F2;
+  F0 half_denominator = RoundingHalfSum(a, F0::One());
+  // Newton-Raphson division
+  // https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division
+  // Refer to that page for the logic behind the 48/17 and 32/17 constants.
+  const F2 constant_48_over_17 =
+      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F2, 1515870810, 48.0 / 17.0);
+  const F2 constant_neg_32_over_17 =
+      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F2, -1010580540, -32.0 / 17.0);
+  F2 x = constant_48_over_17 + half_denominator * constant_neg_32_over_17;
+  for (int i = 0; i < 3; i++) {
+    F2 half_denominator_times_x = half_denominator * x;
+    F2 one_minus_half_denominator_times_x =
+        F2::One() - half_denominator_times_x;
+    x = x + Rescale<2>(x * one_minus_half_denominator_times_x);
+  }
+  return Rescale<0>(x - F2::One());
+}
+
+// Returns -tanh(x) for x < 0.
+template <typename tRawType, int tIntegerBits>
+FixedPoint<tRawType, 0> neg_tanh_on_negative_values(
+    FixedPoint<tRawType, tIntegerBits> a) {
+  return one_minus_x_over_one_plus_x_for_x_in_0_1(
+      exp_on_negative_values(ExactMulByPot<1>(a)));
+}
+
+// Returns tanh(x) for any x.
+template <typename tRawType, int tIntegerBits>
+FixedPoint<tRawType, 0> tanh(FixedPoint<tRawType, tIntegerBits> a) {
+  typedef FixedPoint<tRawType, tIntegerBits> InputF;
+  typedef FixedPoint<tRawType, 0> ResultF;
+  tRawType mask_if_negative = MaskIfLessThan(a, InputF::Zero());
+  tRawType mask_if_zero = MaskIfZero(a);
+  InputF n = SelectUsingMask(mask_if_negative, a, -a);
+  ResultF t = neg_tanh_on_negative_values(n);
+  return SelectUsingMask(mask_if_zero, ResultF::Zero(),
+                         SelectUsingMask(mask_if_negative, -t, t));
+}
+
+// Implementation of logistic function.
+
+// Returns 1 / (1 + x) for x in (0, 1).
+template <typename tRawType>
+FixedPoint<tRawType, 0> one_over_one_plus_x_for_x_in_0_1(
+    FixedPoint<tRawType, 0> a) {
+  typedef FixedPoint<tRawType, 0> F0;
+  typedef FixedPoint<tRawType, 2> F2;
+  F0 half_denominator = RoundingHalfSum(a, F0::One());
+  // Newton-Raphson division
+  // https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division
+  // Refer to that page for the logic behind the 48/17 and 32/17 constants.
+  const F2 constant_48_over_17 =
+      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F2, 1515870810, 48.0 / 17.0);
+  const F2 constant_neg_32_over_17 =
+      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(F2, -1010580540, -32.0 / 17.0);
+  F2 x = constant_48_over_17 + half_denominator * constant_neg_32_over_17;
+  for (int i = 0; i < 3; i++) {
+    F2 half_denominator_times_x = half_denominator * x;
+    F2 one_minus_half_denominator_times_x =
+        F2::One() - half_denominator_times_x;
+    x = x + Rescale<2>(x * one_minus_half_denominator_times_x);
+  }
+  return Rescale<0>(ExactMulByPot<-1>(x));
+}
+
+// Returns logistic(x) = 1 / (1 + exp(-x)) for x > 0.
+template <typename tRawType, int tIntegerBits>
+FixedPoint<tRawType, 0> logistic_on_positive_values(
+    FixedPoint<tRawType, tIntegerBits> a) {
+  return one_over_one_plus_x_for_x_in_0_1(exp_on_negative_values(-a));
+}
+
+// Returns logistic(x) = 1 / (1 + exp(-x)) for any x.
+template <typename tRawType, int tIntegerBits>
+FixedPoint<tRawType, 0> logistic(FixedPoint<tRawType, tIntegerBits> a) {
+  typedef FixedPoint<tRawType, tIntegerBits> InputF;
+  typedef FixedPoint<tRawType, 0> ResultF;
+  tRawType mask_if_positive = MaskIfGreaterThan(a, InputF::Zero());
+  tRawType mask_if_zero = MaskIfZero(a);
+  InputF abs_input = SelectUsingMask(mask_if_positive, a, -a);
+  ResultF result_if_positive = logistic_on_positive_values(abs_input);
+  ResultF result_if_negative = ResultF::One() - result_if_positive;
+  const ResultF one_half =
+      GEMMLOWP_CHECKED_FIXEDPOINT_CONSTANT(ResultF, 1 << 30, 0.5);
+  return SelectUsingMask(mask_if_zero, one_half,
+                         SelectUsingMask(mask_if_positive, result_if_positive,
+                                         result_if_negative));
+}
+
+}  // end namespace gemmlowp
+
+#ifdef GEMMLOWP_NEON
+#include "./fixedpoint_neon.h"
+#elif defined(GEMMLOWP_SSE4)
+#include "./fixedpoint_sse.h"
+#endif
+
+#endif  // GEMMLOWP_INTERNAL_FIXEDPOINT_H_
diff --git a/fixedpoint/fixedpoint_neon.h b/fixedpoint/fixedpoint_neon.h
new file mode 100644
index 0000000..8b23de2
--- /dev/null
+++ b/fixedpoint/fixedpoint_neon.h
@@ -0,0 +1,175 @@
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// fixedpoint_neon.h: optimized NEON specializations of the templates
+// in fixedpoint.h.
+
+#ifndef GEMMLOWP_INTERNAL_FIXEDPOINT_NEON_H_
+#define GEMMLOWP_INTERNAL_FIXEDPOINT_NEON_H_
+
+#include <arm_neon.h>
+
+namespace gemmlowp {
+
+template <>
+struct FixedPointRawTypeTraits<int32x4_t> {
+  typedef std::int32_t ScalarRawType;
+  static const int kLanes = 4;
+};
+
+template <>
+inline int32x4_t BitAnd(int32x4_t a, int32x4_t b) {
+  return vandq_s32(a, b);
+}
+
+template <>
+inline int32x4_t BitOr(int32x4_t a, int32x4_t b) {
+  return vorrq_s32(a, b);
+}
+
+template <>
+inline int32x4_t BitXor(int32x4_t a, int32x4_t b) {
+  return veorq_s32(a, b);
+}
+
+template <>
+inline int32x4_t BitNot(int32x4_t a) {
+  return veorq_s32(a, vdupq_n_s32(-1));
+}
+
+template <>
+inline int32x4_t Add(int32x4_t a, int32x4_t b) {
+  return vaddq_s32(a, b);
+}
+
+template <>
+inline int32x4_t Sub(int32x4_t a, int32x4_t b) {
+  return vsubq_s32(a, b);
+}
+
+template <>
+inline int32x4_t Neg(int32x4_t a) {
+  return vnegq_s32(a);
+}
+
+template <>
+inline int32x4_t ShiftLeft(int32x4_t a, int offset) {
+  return vshlq_s32(a, vdupq_n_s32(offset));
+}
+
+template <>
+inline int32x4_t ShiftRight(int32x4_t a, int offset) {
+  return vshlq_s32(a, vdupq_n_s32(-offset));
+}
+
+template <>
+inline int32x4_t SelectUsingMask(int32x4_t if_mask, int32x4_t then_val,
+                                 int32x4_t else_val) {
+  return vbslq_s32(vreinterpretq_u32_s32(if_mask), then_val, else_val);
+}
+
+template <>
+inline int32x4_t MaskIfEqual(int32x4_t a, int32x4_t b) {
+  return vreinterpretq_s32_u32(vceqq_s32(a, b));
+}
+
+template <>
+inline int32x4_t MaskIfNotEqual(int32x4_t a, int32x4_t b) {
+  return BitNot(MaskIfEqual(a, b));
+}
+
+template <>
+inline int32x4_t MaskIfZero(int32x4_t a) {
+  return MaskIfEqual(a, vdupq_n_s32(0));
+}
+
+template <>
+inline int32x4_t MaskIfNonZero(int32x4_t a) {
+  return vreinterpretq_s32_u32(vtstq_s32(a, a));
+}
+
+template <>
+inline int32x4_t MaskIfGreaterThan(int32x4_t a, int32x4_t b) {
+  return vreinterpretq_s32_u32(vcgtq_s32(a, b));
+}
+
+template <>
+inline int32x4_t MaskIfGreaterThanOrEqual(int32x4_t a, int32x4_t b) {
+  return vreinterpretq_s32_u32(vcgeq_s32(a, b));
+}
+
+template <>
+inline int32x4_t MaskIfLessThan(int32x4_t a, int32x4_t b) {
+  return vreinterpretq_s32_u32(vcltq_s32(a, b));
+}
+
+template <>
+inline int32x4_t MaskIfLessThanOrEqual(int32x4_t a, int32x4_t b) {
+  return vreinterpretq_s32_u32(vcleq_s32(a, b));
+}
+
+template <>
+inline bool All(int32x4_t a) {
+  a = vandq_s32(a, vextq_s32(a, a, 1));
+  a = vandq_s32(a, vextq_s32(a, a, 2));
+  return vgetq_lane_s32(a, 0);
+}
+
+template <>
+inline bool Any(int32x4_t a) {
+  a = vorrq_s32(a, vextq_s32(a, a, 1));
+  a = vorrq_s32(a, vextq_s32(a, a, 2));
+  return vgetq_lane_s32(a, 0);
+}
+
+template <>
+inline int32x4_t RoundingHalfSum(int32x4_t a, int32x4_t b) {
+  return vrhaddq_s32(a, b);
+}
+
+template <>
+inline int32x4_t SaturatingRoundingDoublingHighMul(int32x4_t a, int32x4_t b) {
+  return vqrdmulhq_s32(a, b);
+}
+
+template <>
+inline int32x4_t RoundingDivideByPOT(int32x4_t x, int exponent) {
+  const int32x4_t shift_vec = vdupq_n_s32(-exponent);
+  const int32x4_t fixup = vshrq_n_s32(vandq_s32(x, shift_vec), 31);
+  const int32x4_t fixed_up_x = vqaddq_s32(x, fixup);
+  return vrshlq_s32(fixed_up_x, shift_vec);
+}
+
+template <int Exponent>
+struct ImplSaturatingRoundingMultiplyByPOT<Exponent, int32x4_t, 1> {
+  static int32x4_t eval(int32x4_t x) { return vqshlq_n_s32(x, Exponent); }
+};
+
+template <int Exponent>
+struct ImplSaturatingRoundingMultiplyByPOT<Exponent, int32x4_t, -1> {
+  static int32x4_t eval(int32x4_t x) {
+    const int32x4_t fixup = vshrq_n_s32(x, 31);
+    const int32x4_t fixed_up_x = vqaddq_s32(x, fixup);
+    return vrshrq_n_s32(fixed_up_x, -Exponent);
+  }
+};
+
+template <>
+inline int32x4_t Dup<int32x4_t>(std::int32_t x) {
+  return vdupq_n_s32(x);
+}
+
+}  // end namespace gemmlowp
+
+#endif  // GEMMLOWP_INTERNAL_FIXEDPOINT_NEON_H_
diff --git a/fixedpoint/fixedpoint_sse.h b/fixedpoint/fixedpoint_sse.h
new file mode 100644
index 0000000..3f2654d
--- /dev/null
+++ b/fixedpoint/fixedpoint_sse.h
@@ -0,0 +1,218 @@
+// Copyright 2015 Google Inc. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// fixedpoint_SSE.h: optimized SSE specializations of the templates
+// in fixedpoint.h.
+
+#ifndef GEMMLOWP_INTERNAL_FIXEDPOINT_SSE_H_
+#define GEMMLOWP_INTERNAL_FIXEDPOINT_SSE_H_
+
+#include <smmintrin.h>
+#include "fixedpoint.h"
+
+namespace gemmlowp {
+
+template <>
+struct FixedPointRawTypeTraits<__m128i> {
+  typedef std::int32_t ScalarRawType;
+  static const int kLanes = 4;
+};
+
+template <>
+inline __m128i BitAnd(__m128i a, __m128i b) {
+  return _mm_and_si128(a, b);
+}
+
+template <>
+inline __m128i BitOr(__m128i a, __m128i b) {
+  return _mm_or_si128(a, b);
+}
+
+template <>
+inline __m128i BitXor(__m128i a, __m128i b) {
+  return _mm_xor_si128(a, b);
+}
+
+template <>
+inline __m128i BitNot(__m128i a) {
+  return _mm_andnot_si128(a, _mm_set1_epi32(-1));
+}
+
+template <>
+inline __m128i Add(__m128i a, __m128i b) {
+  return _mm_add_epi32(a, b);
+}
+
+template <>
+inline __m128i Mul(__m128i a, __m128i b) {
+  return _mm_mullo_epi32(a, b);
+}
+
+template <>
+inline __m128i Sub(__m128i a, __m128i b) {
+  return _mm_sub_epi32(a, b);
+}
+
+template <>
+inline __m128i Neg(__m128i a) {
+  return _mm_sign_epi32(a, _mm_set1_epi32(-1));
+}
+
+template <>
+inline __m128i ShiftLeft(__m128i a, int offset) {
+  return _mm_slli_epi32(a, offset);
+}
+
+template <>
+inline __m128i ShiftRight(__m128i a, int offset) {
+  return _mm_srai_epi32(a, offset);
+}
+
+template <>
+inline __m128i SelectUsingMask(__m128i if_mask, __m128i then_val,
+                               __m128i else_val) {
+  return _mm_castps_si128(_mm_blendv_ps(_mm_castsi128_ps(else_val),
+                                        _mm_castsi128_ps(then_val),
+                                        _mm_castsi128_ps(if_mask)));
+}
+
+template <>
+inline __m128i MaskIfEqual(__m128i a, __m128i b) {
+  return _mm_cmpeq_epi32(a, b);
+}
+
+template <>
+inline __m128i MaskIfNotEqual(__m128i a, __m128i b) {
+  return BitNot(MaskIfEqual(a, b));
+}
+
+template <>
+inline __m128i MaskIfZero(__m128i a) {
+  return MaskIfEqual(a, _mm_set1_epi32(0));
+}
+
+template <>
+inline __m128i MaskIfNonZero(__m128i a) {
+  return MaskIfNotEqual(a, _mm_set1_epi32(0));
+}
+
+template <>
+inline __m128i MaskIfGreaterThan(__m128i a, __m128i b) {
+  return _mm_cmpgt_epi32(a, b);
+}
+
+template <>
+inline __m128i MaskIfLessThan(__m128i a, __m128i b) {
+  return _mm_cmplt_epi32(a, b);
+}
+
+template <>
+inline __m128i MaskIfGreaterThanOrEqual(__m128i a, __m128i b) {
+  return BitNot(MaskIfLessThan(a, b));
+}
+
+template <>
+inline __m128i MaskIfLessThanOrEqual(__m128i a, __m128i b) {
+  return BitNot(MaskIfGreaterThan(a, b));
+}
+
+/* Assumptions:
+   - All and Any are used on masks.
+   - masks are all_ones for true lanes, all_zeroes otherwise.
+Hence, All means all 128bits set, and Any means any bit set.
+*/
+
+template <>
+inline bool All(__m128i a) {
+  return _mm_testc_si128(a, a);
+}
+
+template <>
+inline bool Any(__m128i a) {
+  return BitNot(_mm_testz_si128(a, a));
+}
+
+template <>
+inline __m128i RoundingHalfSum(__m128i a, __m128i b) {
+  /* __m128i round_bit_mask, a_over_2, b_over_2, round_bit, sum; */
+  /* We divide the inputs before the add to avoid the overflow and costly test
+   */
+  /* of checking if an overflow occured on signed add */
+  /* round_bit_mask = _mm_set1_epi32(1); */
+  /* a_over_2 = _mm_srai_epi32(a, 1); */
+  /* b_over_2 = _mm_srai_epi32(b, 1); */
+  /* sum = Add(a_over_2, b_over_2); */
+  /* round_bit = _mm_sign_epi32(BitAnd(BitOr(a,b), round_bit_mask), sum); */
+  /* return Add(sum, round_bit); */
+
+  /* Other possibility detecting overflow and xor the sign if an overflow
+   * happened*/
+  __m128i one, sign_bit_mask, sum, rounded_half_sum, overflow, result;
+  one = _mm_set1_epi32(1);
+  sign_bit_mask = _mm_set1_epi32(0x80000000);
+  sum = Add(a, b);
+  rounded_half_sum = _mm_srai_epi32(Add(sum, one), 1);
+  overflow =
+      BitAnd(BitAnd(BitXor(a, rounded_half_sum), BitXor(b, rounded_half_sum)),
+             sign_bit_mask);
+  result = BitXor(rounded_half_sum, overflow);
+  return result;
+}
+
+template <>
+inline __m128i SaturatingRoundingDoublingHighMul(__m128i a, __m128i b) {
+  __m128i min, saturation_mask, a0_a2, a1_a3, b0_b2, b1_b3;
+  __m128i a0b0_a2b2, a1b1_a3b3, a0b0_a2b2_rounded, a1b1_a3b3_rounded;
+  __m128i a0b0_a2b2_rounded_2x, a1b1_a3b3_rounded_2x, result;
+  __m128i nudge;
+
+  // saturation only happen if a == b == INT_MIN
+  min = _mm_set1_epi32(std::numeric_limits<std::int32_t>::min());
+  saturation_mask = BitAnd(MaskIfEqual(a, b), MaskIfEqual(a, min));
+
+  // a = a0 | a1 | a2 | a3
+  // b = b0 | b1 | b2 | b3
+  a0_a2 = a;
+  a1_a3 = _mm_srli_si128(a, 4);
+  b0_b2 = b;
+  b1_b3 = _mm_srli_si128(b, 4);
+
+  a0b0_a2b2 = _mm_mul_epi32(a0_a2, b0_b2);
+  a1b1_a3b3 = _mm_mul_epi32(a1_a3, b1_b3);
+
+  // do the rounding and take into account that it will be doubled
+  nudge = _mm_set1_epi64x(1 << 30);
+  a0b0_a2b2_rounded = _mm_add_epi64(a0b0_a2b2, nudge);
+  a1b1_a3b3_rounded = _mm_add_epi64(a1b1_a3b3, nudge);
+
+  // do the doubling
+  a0b0_a2b2_rounded_2x = _mm_slli_epi64(a0b0_a2b2_rounded, 1);
+  a1b1_a3b3_rounded_2x = _mm_slli_epi64(a1b1_a3b3_rounded, 1);
+
+  // get the high part of the products
+  result = _mm_blend_epi16(_mm_srli_si128(a0b0_a2b2_rounded_2x, 4),
+                           a1b1_a3b3_rounded_2x, 0xcc);
+
+  // saturate those which overflowed
+  return SelectUsingMask(saturation_mask, min, result);
+}
+
+template <>
+inline __m128i Dup<__m128i>(std::int32_t x) {
+  return _mm_set1_epi32(x);
+}
+
+}  // end namespace gemmlowp
+
+#endif  // GEMMLOWP_INTERNAL_FIXEDPOINT_SSE_H_
diff --git a/flags.bzl b/flags.bzl
index 46e521f..16dba2d 100644
--- a/flags.bzl
+++ b/flags.bzl
@@ -1,3 +1,12 @@
-LIB_LINKOPTS = ["-lpthread"]
+# Android builds do not need to link in a separate pthread library.
+LIB_COPTS = []
 
-BIN_LINKOPTS = ["-lpthread"]
+LIB_LINKOPTS = select({
+    ":android": [],
+    "//conditions:default": ["-lpthread"],
+})
+
+BIN_LINKOPTS = select({
+    ":android": [],
+    "//conditions:default": ["-lpthread"],
+})
diff --git a/internal/allocator.h b/internal/allocator.h
index b0d7781..0fe4a01 100644
--- a/internal/allocator.h
+++ b/internal/allocator.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -43,10 +43,19 @@
 
 #if defined ANDROID || defined __ANDROID__
 #include <android/api-level.h>
-#if __ANDROID_API__ < 16
+// The 18 here should be 16, but has to be 18 for now due
+// to a Google-internal issue.
+#if __ANDROID_API__ < 18
 #include <malloc.h>
 #define GEMMLOWP_USE_MEMALIGN
 #endif
+// posix_memalign is missing on some 4.1 x86 devices
+#if __ANDROID_API__ == 18
+#ifdef GEMMLOWP_X86_32
+#include <malloc.h>
+#define GEMMLOWP_USE_MEMALIGN
+#endif
+#endif
 #endif
 
 namespace gemmlowp {
diff --git a/internal/block_params.h b/internal/block_params.h
index 48f93df..b2fc3ff 100644
--- a/internal/block_params.h
+++ b/internal/block_params.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -43,15 +43,19 @@
   int l2_depth;
 
   template <typename KernelFormat>
-  void Init(int rows, int cols, int depth, int num_threads) {
-    FindL2BlockSizes<KernelFormat>(rows, cols, depth, num_threads, &l2_rows,
-                                   &l2_cols, &l2_depth);
-    FindL1BlockSizes<KernelFormat>(l2_rows, l2_cols, l2_depth, &l1_rows,
-                                   &l1_cols, &l1_depth);
+  void Init(int rows, int cols, int depth, int num_threads,
+            int l1_bytes_to_use, int l2_bytes_to_use, float l2_rhs_factor) {
+    FindL2BlockSizes<KernelFormat>(rows, cols, depth, num_threads,
+                                   l2_bytes_to_use, l2_rhs_factor,
+                                   &l2_rows, &l2_cols, &l2_depth);
+    FindL1BlockSizes<KernelFormat>(l2_rows, l2_cols, l2_depth,
+                                   l1_bytes_to_use,
+                                   &l1_rows, &l1_cols, &l1_depth);
   }
 
   template <typename KernelFormat>
   static void FindL2BlockSizes(int rows, int cols, int depth, int num_threads,
+                               int l2_bytes_to_use, float l2_rhs_factor,
                                int* out_l2_rows, int* out_l2_cols,
                                int* out_l2_depth) {
     int l2_rows = 0;
@@ -64,9 +68,6 @@
     // of register size, so as to avoid having to special-case unaligned depths.
     l2_depth = RoundUp<kRegisterSize>(depth);
 
-    const int l2_bytes_to_use = kDefaultL2CacheSize;
-    const float l2_rhs_factor = kDefaultL2RhsFactor;
-
     {
       int max_cache_friendly_l2_cols = std::max(
           1, static_cast<int>(l2_rhs_factor * (l2_bytes_to_use / l2_depth)));
@@ -97,7 +98,8 @@
   }
 
   template <typename KernelFormat>
-  static void FindL1BlockSizes(int rows, int cols, int depth, int* out_l1_rows,
+  static void FindL1BlockSizes(int rows, int cols, int depth,
+                               int l1_bytes_to_use, int* out_l1_rows,
                                int* out_l1_cols, int* out_l1_depth) {
     int l1_rows = 0;
     int l1_cols = 0;
@@ -112,8 +114,6 @@
     // Thought not to be needed. Similar to Eigen.
     l1_cols = cols;
 
-    const int l1_bytes_to_use = kDefaultL1CacheSize;
-
     {
       int max_cache_friendly_l1_depth = std::max(
           1, (l1_bytes_to_use - 4 * KernelFormat::kRows * KernelFormat::kCols) /
diff --git a/internal/common.h b/internal/common.h
index 3d94041..1d89b26 100644
--- a/internal/common.h
+++ b/internal/common.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -86,20 +86,32 @@
 #define GEMMLOWP_NEON_64
 #endif
 
-// Detect SSE4.
-#if defined __SSE4_1__
+// Detect SSE.
+#ifdef __SSE4_1__
 #define GEMMLOWP_SSE4
 #endif
 
+#ifdef __SSE3__
+#define GEMMLOWP_SSE3
+#endif
+
 // Convenience SSE4 tokens for 32-bit or 64-bit
 #if defined(GEMMLOWP_SSE4) && defined(GEMMLOWP_X86_32)
 #define GEMMLOWP_SSE4_32
 #endif
 
+#if defined(GEMMLOWP_SSE3) && defined(GEMMLOWP_X86_32)
+#define GEMMLOWP_SSE3_32
+#endif
+
 #if defined(GEMMLOWP_SSE4) && defined(GEMMLOWP_X86_64)
 #define GEMMLOWP_SSE4_64
 #endif
 
+#if defined(GEMMLOWP_SSE3) && defined(GEMMLOWP_X86_64)
+#define GEMMLOWP_SSE3_64
+#endif
+
 #endif  // GEMMLOWP_ALLOW_INLINE_ASM
 
 // Detect Android. Don't conflate with ARM - we care about tuning
@@ -134,8 +146,13 @@
 // Of course, these values are in principle too low for typical x86 CPUs
 // where we should set the L2 value to (L3 cache size / number of cores) at
 // least.
-#if defined(GEMMLOWP_ARM) || defined(GEMMLOWP_ANDROID)
-// ARM or ARM-like hardware (Android implies ARM-like) so here it's OK
+//
+#if defined(GEMMLOWP_ARM) && defined(__APPLE__)
+// iPhone/iPad
+const int kDefaultL1CacheSize = 48 * 1024;
+const int kDefaultL2CacheSize = 2 * 1024 * 1024;
+#elif defined(GEMMLOWP_ARM) || defined(GEMMLOWP_ANDROID)
+// Other ARM or ARM-like hardware (Android implies ARM-like) so here it's OK
 // to tune for ARM, although on x86 Atom we might be able to query
 // cache sizes at runtime, which would be better.
 const int kDefaultL1CacheSize = 16 * 1024;
@@ -180,13 +197,17 @@
 // are consistent with this value.
 const int kRegisterSize = 16;
 
-// Requantization to less-than-8-bit is costly, so it only worth
-// doing if the GEMM width is large enough
-const int kMinimumWidthForRequantization = 100;
-
 // Hints the CPU to prefetch the cache line containing ptr.
 inline void Prefetch(const void* ptr) {
-#ifdef __GNUC__  // Clang and GCC define __GNUC__ and have __builtin_prefetch.
+#if defined GEMMLOWP_ARM_64 && defined GEMMLOWP_ALLOW_INLINE_ASM
+  // Aarch64 has very detailed prefetch instructions, that compilers
+  // can't know how to map __builtin_prefetch to, and as a result, don't,
+  // leaving __builtin_prefetch a no-op on this architecture.
+  // For our purposes, "pldl1keep" is usually what we want, meaning:
+  // "prefetch for load, into L1 cache, using each value multiple times".
+  asm volatile("prfm pldl1keep, [%[ptr]]\n" ::[ptr] "r"(ptr) : );
+#elif defined \
+    __GNUC__  // Clang and GCC define __GNUC__ and have __builtin_prefetch.
   __builtin_prefetch(ptr);
 #else
   (void)ptr;
diff --git a/internal/compute.h b/internal/compute.h
index 4587df3..bbc9e2a 100644
--- a/internal/compute.h
+++ b/internal/compute.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -48,9 +48,11 @@
         packed_lhs_(_packed_lhs),
         packed_rhs_(_packed_rhs) {}
 
-  void Compute() {
-    for (int d = 0; d < block_params_.l2_depth; d += block_params_.l1_depth) {
-      int ds = std::min(block_params_.l1_depth, block_params_.l2_depth - d);
+  void Compute(int depth) {
+    depth = RoundUp<Format::kDepth>(depth);
+    assert(depth <= block_params_.l2_depth);
+    for (int d = 0; d < depth; d += block_params_.l1_depth) {
+      int ds = std::min(block_params_.l1_depth, depth - d);
 
       for (int r = 0; r < block_params_.l2_rows; r += block_params_.l1_rows) {
         int rs = std::min(block_params_.l1_rows, block_params_.l2_rows - r);
@@ -89,12 +91,12 @@
 template <typename PackedLhs, typename PackedRhs, typename PackedResult>
 void Compute(const KernelBase& kernel, const BlockParams& block_params,
              PackedResult* packed_result, const PackedLhs& packed_lhs,
-             const PackedRhs& packed_rhs) {
+             const PackedRhs& packed_rhs, int depth) {
   ScopedProfilingLabel label("compute");
   ComputeImpl<PackedLhs, PackedRhs, PackedResult> impl(
       kernel, block_params, packed_result, packed_lhs, packed_rhs);
 
-  impl.Compute();
+  impl.Compute(depth);
 }
 
 }  // namespace gemmlowp
diff --git a/internal/dispatch_gemm_shape.h b/internal/dispatch_gemm_shape.h
new file mode 100644
index 0000000..0be0bf3
--- /dev/null
+++ b/internal/dispatch_gemm_shape.h
@@ -0,0 +1,189 @@
+// Copyright 2017 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// dispatch_gemm_shape.h: dispatch GEMM calls according to their shape
+
+#ifndef GEMMLOWP_INTERNAL_DISPATCH_GEMM_SHAPE_H_
+#define GEMMLOWP_INTERNAL_DISPATCH_GEMM_SHAPE_H_
+
+#include "../internal/kernel_default.h"
+#include "../public/map.h"
+#include "../public/output_stages.h"
+#include "multi_thread_gemm.h"
+
+namespace gemmlowp {
+
+template <typename T>
+struct TransposeImpl {
+  typedef T DstType;
+  static T Run(const T& t) { return t; }
+};
+
+template <typename T>
+using TransposeType = typename TransposeImpl<T>::DstType;
+
+template <typename T>
+TransposeType<T> Transpose(const T& t) {
+  return TransposeImpl<T>::Run(t);
+}
+
+template <MapOrder Order>
+struct TransposeMapOrder {
+  static constexpr MapOrder Value =
+      Order == MapOrder::RowMajor ? MapOrder::ColMajor : MapOrder::RowMajor;
+};
+
+template <VectorShape Shape>
+struct TransposeVectorShape {
+  static constexpr VectorShape Value =
+      Shape == VectorShape::Row ? VectorShape::Col : VectorShape::Row;
+};
+
+template <typename Scalar, VectorShape Shape>
+struct TransposeImpl<VectorMap<Scalar, Shape>> {
+  typedef VectorMap<Scalar, Shape> SrcType;
+  static constexpr VectorShape TransposedShape =
+      TransposeVectorShape<Shape>::Value;
+  typedef VectorMap<Scalar, TransposedShape> DstType;
+  static DstType Run(const SrcType& src) {
+    return DstType(src.data(), src.size());
+  }
+};
+
+template <typename Scalar, MapOrder Order>
+struct TransposeImpl<MatrixMap<Scalar, Order>> {
+  typedef MatrixMap<Scalar, Order> SrcType;
+  static constexpr MapOrder TransposedOrder = TransposeMapOrder<Order>::Value;
+  typedef MatrixMap<Scalar, TransposedOrder> DstType;
+  static DstType Run(const SrcType& src) {
+    return DstType(src.data(), src.cols(), src.rows(), src.stride());
+  }
+};
+
+template <VectorShape Shape>
+struct TransposeImpl<OutputStageQuantizeDownInt32ToUint8ScalePC<Shape>> {
+  typedef OutputStageQuantizeDownInt32ToUint8ScalePC<Shape> SrcType;
+  static const VectorShape TransposedShape = TransposeVectorShape<Shape>::Value;
+  typedef OutputStageQuantizeDownInt32ToUint8ScalePC<TransposedShape> DstType;
+  static DstType Run(const SrcType& src) {
+    DstType dst;
+    dst.result_shift = src.result_shift;
+    dst.result_offset = Transpose(src.result_offset);
+    dst.result_mult_int = Transpose(src.result_mult_int);
+    return dst;
+  }
+};
+
+template <typename VectorMapType>
+struct TransposeImpl<OutputStageBiasAddition<VectorMapType>> {
+  typedef OutputStageBiasAddition<VectorMapType> SrcType;
+  typedef TransposeType<VectorMapType> TransposedVectorMapType;
+  typedef OutputStageBiasAddition<TransposedVectorMapType> DstType;
+  static DstType Run(const SrcType& src) {
+    DstType dst;
+    dst.bias_vector = Transpose(src.bias_vector);
+    return dst;
+  }
+};
+
+// TODO(benoitjacob) - does anyone understand C++ variadic templates?
+// How to use them to implement TransposeTuple? Note: there are lots
+// of answers on StackOverflow but they seem to all involve either
+// C++14/C++17 (we can only use C++11) or lots of abstract nonsense.
+inline std::tuple<> TransposeTuple(const std::tuple<>& t) { return t; }
+
+template <typename T0>
+std::tuple<TransposeType<T0>> TransposeTuple(const std::tuple<T0>& t) {
+  return std::make_tuple(Transpose(std::get<0>(t)));
+}
+
+template <typename T0, typename T1>
+std::tuple<TransposeType<T0>, TransposeType<T1>> TransposeTuple(
+    const std::tuple<T0, T1>& t) {
+  return std::make_tuple(Transpose(std::get<0>(t)), Transpose(std::get<1>(t)));
+}
+
+template <typename T0, typename T1, typename T2>
+std::tuple<TransposeType<T0>, TransposeType<T1>, TransposeType<T2>>
+TransposeTuple(const std::tuple<T0, T1, T2>& t) {
+  return std::make_tuple(Transpose(std::get<0>(t)), Transpose(std::get<1>(t)),
+                         Transpose(std::get<2>(t)));
+}
+
+template <typename T0, typename T1, typename T2, typename T3>
+std::tuple<TransposeType<T0>, TransposeType<T1>, TransposeType<T2>,
+           TransposeType<T3>>
+TransposeTuple(const std::tuple<T0, T1, T2, T3>& t) {
+  return std::make_tuple(Transpose(std::get<0>(t)), Transpose(std::get<1>(t)),
+                         Transpose(std::get<2>(t)), Transpose(std::get<3>(t)));
+}
+
+template <typename T0, typename T1, typename T2, typename T3, typename T4>
+std::tuple<TransposeType<T0>, TransposeType<T1>, TransposeType<T2>,
+           TransposeType<T3>, TransposeType<T4>>
+TransposeTuple(const std::tuple<T0, T1, T2, T3, T4>& t) {
+  return std::make_tuple(Transpose(std::get<0>(t)), Transpose(std::get<1>(t)),
+                         Transpose(std::get<2>(t)), Transpose(std::get<3>(t)),
+                         Transpose(std::get<4>(t)));
+}
+
+template <typename T0, typename T1, typename T2, typename T3, typename T4,
+          typename T5>
+std::tuple<TransposeType<T0>, TransposeType<T1>, TransposeType<T2>,
+           TransposeType<T3>, TransposeType<T4>, TransposeType<T5>>
+TransposeTuple(const std::tuple<T0, T1, T2, T3, T4, T5>& t) {
+  return std::make_tuple(Transpose(std::get<0>(t)), Transpose(std::get<1>(t)),
+                         Transpose(std::get<2>(t)), Transpose(std::get<3>(t)),
+                         Transpose(std::get<4>(t)), Transpose(std::get<5>(t)));
+}
+
+template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
+          MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
+          typename LhsOffset, typename RhsOffset, typename OutputPipelineType,
+          typename GemmContextType>
+void DispatchGemmShape(GemmContextType* context,
+                       const MatrixMap<const InputScalar, LhsOrder>& lhs,
+                       const MatrixMap<const InputScalar, RhsOrder>& rhs,
+                       MatrixMap<OutputScalar, ResultOrder>* result,
+                       const LhsOffset& lhs_offset, const RhsOffset& rhs_offset,
+                       const OutputPipelineType& output_pipeline) {
+  assert(lhs.cols() == rhs.rows());
+
+  int rows = result->rows();
+  int cols = result->cols();
+  int depth = lhs.cols();
+
+  if (rows == 0 || cols == 0 || depth == 0) {
+    // Vacuous GEMM, return early to avoid having to deal with
+    // zero sizes below.
+    return;
+  }
+
+  if (rows < cols) {
+    auto transposed_result_map = Transpose(*result);
+    return DispatchGemmShape<InputScalar, OutputScalar, BitDepthParams>(
+        context, Transpose(rhs), Transpose(lhs), &transposed_result_map,
+        Transpose(rhs_offset), Transpose(lhs_offset),
+        TransposeTuple(output_pipeline));
+  }
+
+  typedef DefaultKernel<BitDepthParams> Kernel;
+  MultiThreadGemm<typename Kernel::Format, InputScalar, OutputScalar,
+                  BitDepthParams>(context, Kernel(), lhs, rhs, result,
+                                  lhs_offset, rhs_offset, output_pipeline);
+}
+
+}  // end namespace gemmlowp
+
+#endif  // GEMMLOWP_INTERNAL_DISPATCH_GEMM_SHAPE_H_
diff --git a/internal/iterator.h b/internal/iterator.h
index 917694d..524cb80 100644
--- a/internal/iterator.h
+++ b/internal/iterator.h
@@ -34,8 +34,9 @@
 class ConstIterator<VectorMap<tScalar, tShape>> {
  public:
   typedef tScalar Scalar;
-  ConstIterator(const VectorMap<tScalar, tShape>& vector_map)
-      : pointer_(vector_map.data()) {}
+  ConstIterator(const VectorMap<tScalar, tShape>& vector_map,
+                const int start_offset)
+      : pointer_(vector_map.data() + start_offset) {}
   const Scalar operator*() const { return *pointer_; }
   const Scalar* get() const { return pointer_; }
   ConstIterator& operator+=(int inc) { pointer_ += inc; return *this; }
@@ -45,8 +46,9 @@
 
 template <typename tScalar, VectorShape tShape>
 ConstIterator<VectorMap<tScalar, tShape>> const_iterator(
-    const VectorMap<tScalar, tShape>& vector_map) {
-  return ConstIterator<VectorMap<tScalar, tShape>>(vector_map);
+    const VectorMap<tScalar, tShape>& vector_map,
+    const int start_offset) {
+  return ConstIterator<VectorMap<tScalar, tShape>>(vector_map, start_offset);
 }
 
 template <typename tScalar, VectorShape tShape> class VectorDup;
@@ -66,7 +68,8 @@
 
 template <typename tScalar, VectorShape tShape>
 ConstIterator<VectorDup<tScalar, tShape>> const_iterator(
-    const VectorDup<tScalar, tShape>& vector_map) {
+    const VectorDup<tScalar, tShape>& vector_map,
+    const int start_offset) {
   return ConstIterator<VectorDup<tScalar, tShape>>(vector_map);
 }
 
diff --git a/internal/kernel.h b/internal/kernel.h
index 1aceec7..4d006af 100644
--- a/internal/kernel.h
+++ b/internal/kernel.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -145,6 +145,12 @@
   static const int kCells = tCells;
   static const int kWidth = kCells * Cell::kWidth;
   static const int kDepth = Cell::kDepth;
+  typedef std::uint8_t Scalar;
+};
+
+template <typename tCellFormat, int tCells>
+struct KernelSideFormatInt8 : KernelSideFormat<tCellFormat, tCells> {
+  typedef std::int8_t Scalar;
 };
 
 // KernelFormat describes fully the input data layout that a kernel expects.
@@ -210,6 +216,19 @@
   virtual ~KernelBase() {}
 };
 
+template <typename KernelScalarType>
+struct ZeroPointInputValue {};
+
+template <>
+struct ZeroPointInputValue<std::uint8_t> {
+  static constexpr std::uint8_t kValue = 0;
+};
+
+template <>
+struct ZeroPointInputValue<std::int8_t> {
+  static constexpr std::uint8_t kValue = 128;
+};
+
 }  // namespace gemmlowp
 
 #endif  // GEMMLOWP_INTERNAL_KERNEL_H_
diff --git a/internal/kernel_default.h b/internal/kernel_default.h
index 22bf4d0..bba0093 100644
--- a/internal/kernel_default.h
+++ b/internal/kernel_default.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -20,56 +20,86 @@
 
 #include "../public/bit_depth.h"
 #include "common.h"
+#include "kernel_reference.h"
 
 namespace gemmlowp {
 
-enum class KernelFamily { Gemm, Gemv };
+template <bool MaxProductIsLessThan4096,
+          bool LhsAlwaysNonzero>
+struct DefaultKernelImpl {};
 
-template <KernelFamily Family, int ProductBits>
-struct DefaultKernelImpl : DefaultKernelImpl<Family, ProductBits + 1> {
-  static_assert(ProductBits <= 16, "Bit depth too large");
-};
+// Partial specialization implementing the logic that if we want to use
+// a kernel for LhsAlwaysNonzero but do not have such a kernel, then we fall
+// back to a generic kernel not taking advantage of LhsAlwaysNonzero.
+template <bool LhsAlwaysNonzero>
+struct DefaultKernelImpl<true, LhsAlwaysNonzero>
+    : DefaultKernelImpl<false, LhsAlwaysNonzero> {};
 
-template <KernelFamily Family, typename BitDepthParams>
+// Partial specialization implementing the logic that if we want to use
+// a kernel for MaxProductIsLessThan4096 but do not have such a kernel, then we
+// fall back to a generic kernel not taking advantage of
+// MaxProductIsLessThan4096.
+template <bool MaxProductIsLessThan4096>
+struct DefaultKernelImpl<MaxProductIsLessThan4096, true>
+    : DefaultKernelImpl<MaxProductIsLessThan4096, false> {};
+
+template <typename BitDepthParams>
 struct DefaultKernel
-    : DefaultKernelImpl<Family, BitDepthParams::LhsBitDepth::kBits +
-                                    BitDepthParams::RhsBitDepth::kBits> {};
+    : DefaultKernelImpl<(BitDepthParams::LhsRange::kMaxValue *
+                             BitDepthParams::RhsRange::kMaxValue <
+                         4096),
+                        (BitDepthParams::LhsRange::kMinValue > 0)> {};
 
 }  // end namespace gemmlowp
 
-#define GEMMLOWP_SET_DEFAULT_KERNEL(op, max_product_bits, kernel)           \
-  namespace gemmlowp {                                                      \
-  template <>                                                               \
-  struct DefaultKernelImpl<KernelFamily::op, max_product_bits> : kernel {}; \
+#define GEMMLOWP_SET_DEFAULT_KERNEL(MaxProductIsLessThan4096, \
+                                    LhsAlwaysNonzero, Kernel) \
+  namespace gemmlowp {                                        \
+  template <>                                                 \
+  struct DefaultKernelImpl<MaxProductIsLessThan4096,          \
+                           LhsAlwaysNonzero> : Kernel {};     \
   }
 
 #if defined GEMMLOWP_NEON_32
 #include "kernel_neon.h"
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemm, 16, NEON_32_Kernel12x4Depth2)
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemm, 12,
+GEMMLOWP_SET_DEFAULT_KERNEL(false, false, NEON_32_Kernel12x4Depth2)
+GEMMLOWP_SET_DEFAULT_KERNEL(true, false,
                             NEON_32_Kernel12x4Depth2Assuming12BitProducts)
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemv, 16, NEONKernel4Nx1Depth2<3>)
+GEMMLOWP_SET_DEFAULT_KERNEL(false, true,
+                            NEON_32bit_GEMM_Int8Operands_LhsNonzero)
 #elif defined GEMMLOWP_NEON_64
 #include "kernel_neon.h"
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemm, 16, NEON_64_Kernel12x8Depth2)
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemv, 16, NEONKernel4Nx1Depth2<3>)
+GEMMLOWP_SET_DEFAULT_KERNEL(false, false, NEON_64_Kernel12x8Depth2)
+GEMMLOWP_SET_DEFAULT_KERNEL(false, true,
+                            NEON_64bit_GEMM_Int8Operands_LhsNonzero)
 #elif defined GEMMLOWP_SSE4_32
-#include "kernel_SSE.h"
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemm, 16, SSE4_32_Kernel4x4Depth2)
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemv, 16, SSE4_32_Kernel4x4Depth2)
+#include "kernel_sse.h"
+GEMMLOWP_SET_DEFAULT_KERNEL(false, false, SSE4_32_Kernel4x4Depth2)
 #elif defined GEMMLOWP_SSE4_64
-#include "kernel_SSE.h"
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemm, 16, SSE4_64_Kernel12x4Depth2)
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemv, 16, SSE4_64_Kernel12x4Depth2)
+#include "kernel_sse.h"
+GEMMLOWP_SET_DEFAULT_KERNEL(false, false, SSE4_64_Kernel12x4Depth2)
 #else
+#ifndef GEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK
+#if defined __ARM_ARCH_5TE__
+// SIMD is not available on this platform. The slow fallback will be used.
+// Don't require GEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK because there's nothing
+// the user can do about it.
+#else
+#error \
+    "SIMD not enabled, you'd be getting a slow software fallback. Consider \
+enabling SIMD extensions (for example using -msse4 if you're on modern x86). \
+If that's not an option, and you would like to continue with the \
+slow fallback, define GEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK."
+#endif
+#endif
 #include "kernel_reference.h"
 namespace gemmlowp {
-typedef ReferenceKernel<KernelFormat<KernelSideFormat<CellFormat<4, 4>, 2>,
-                                     KernelSideFormat<CellFormat<4, 4>, 2> > >
+typedef ReferenceKernel<KernelFormat<
+    KernelSideFormat<CellFormat<4, 16, CellOrder::WidthMajor>, 1>,
+    KernelSideFormat<CellFormat<4, 16, CellOrder::WidthMajor>, 1> > >
     DefaultReferenceKernel;
 }
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemm, 16, DefaultReferenceKernel)
-GEMMLOWP_SET_DEFAULT_KERNEL(Gemv, 16, DefaultReferenceKernel)
+GEMMLOWP_SET_DEFAULT_KERNEL(false, false, DefaultReferenceKernel)
 #endif
 
 #endif  // GEMMLOWP_INTERNAL_KERNEL_DEFAULT_H_
diff --git a/internal/kernel_neon.h b/internal/kernel_neon.h
index 74b5fec..5c253ba 100644
--- a/internal/kernel_neon.h
+++ b/internal/kernel_neon.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -49,30 +49,13 @@
 //  so use numerical ones instead. See
 // http://stackoverflow.com/questions/3898435/labels-in-gcc-inline-assembly
 // If you add any labels, remember to undef them at the end.
-#define GEMMLOWP_LOOP_NEON_KERNEL_12X4_DEPTH2 "1"
-#define GEMMLOWP_STORE_RESULT_NEON_KERNEL_12X4_DEPTH2 "2"
+#define GEMMLOWP_LABEL_CLEAR_ACCUMULATORS "1"
+#define GEMMLOWP_LABEL_BEFORE_LOOP "2"
+#define GEMMLOWP_LABEL_LOOP "3"
+#define GEMMLOWP_LABEL_AFTER_LOOP "4"
 
     assert(dst_row_stride == 1);
     asm volatile(
-        // Clear accumulator registers (see layout below)
-        "vmov.s32 q4, #0\n"
-        "vmov.s32 q8, q4\n"
-        "vmov.s32 q12, q4\n"
-        "vmov.s32 q5, q4\n"
-        "vmov.s32 q9, q4\n"
-        "vmov.s32 q13, q4\n"
-        "vmov.s32 q6, q4\n"
-        "vmov.s32 q10, q4\n"
-        "vmov.s32 q14, q4\n"
-        "vmov.s32 q7, q4\n"
-        "vmov.s32 q11, q4\n"
-        "vmov.s32 q15, q4\n"
-
-        /* Main loop */
-
-        GEMMLOWP_LOOP_NEON_KERNEL_12X4_DEPTH2
-        ":\n"
-
         // Overview of register layout:
         //
         // A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0).
@@ -110,12 +93,125 @@
         //                            Accumulator
 
         // Load 1 Rhs cell of size 2x4
-        "vld1.8 {d0}, [%[rhs_ptr]:64]!\n"
-
+        "vld1.8 {d0}, [%[rhs_ptr]]!\n"
         // Load 3 Lhs cells of size 4x2 each
-        "vld1.8 {d2}, [%[lhs_ptr]:64]!\n"
-        "vld1.8 {d4}, [%[lhs_ptr]:64]!\n"
-        "vld1.8 {d6}, [%[lhs_ptr]:64]!\n"
+        "vld1.8 {d2}, [%[lhs_ptr]]!\n"
+        "vld1.8 {d4}, [%[lhs_ptr]]!\n"
+        "vld1.8 {d6}, [%[lhs_ptr]]!\n"
+
+        // Check if start_depth==0 to decide whether we will clear
+        // accumulators or load existing accumulators.
+        "cmp %[start_depth], #0\n"
+
+        // Multiply dst_col_stride by 4 == sizeof(int32) to use
+        // it as a byte offset below.
+        "lsl %[dst_col_stride], #2\n"
+
+        "beq " GEMMLOWP_LABEL_CLEAR_ACCUMULATORS
+        "f\n"
+
+        // Load accumulators (start_depth != 0)
+        "mov r1, %[dst_ptr]\n"
+        "subs %[run_depth], #2\n"
+        "mov r0, r1\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "add r1, %[dst_col_stride]\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]\n"
+        "mov r0, r1\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "add r1, %[dst_col_stride]\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]\n"
+        "mov r0, r1\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "add r1, %[dst_col_stride]\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]\n"
+        "mov r0, r1\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]\n"
+
+        "b " GEMMLOWP_LABEL_BEFORE_LOOP "f\n"
+
+        GEMMLOWP_LABEL_CLEAR_ACCUMULATORS
+        ":\n"
+
+        // Clear accumulators (start_depth == 0)
+        "vmov.s32 q4, #0\n"
+        "subs %[run_depth], #2\n"
+        "vmov.s32 q8, q4\n"
+        "vmov.s32 q12, q4\n"
+        "vmov.s32 q5, q4\n"
+        "vmov.s32 q9, q4\n"
+        "vmov.s32 q13, q4\n"
+        "vmov.s32 q6, q4\n"
+        "vmov.s32 q10, q4\n"
+        "vmov.s32 q14, q4\n"
+        "vmov.s32 q7, q4\n"
+        "vmov.s32 q11, q4\n"
+        "vmov.s32 q15, q4\n"
+
+        GEMMLOWP_LABEL_BEFORE_LOOP
+        ":\n"
+
+        // If there are only two levels of depth, skip the loop.
+        "beq " GEMMLOWP_LABEL_AFTER_LOOP "f\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+        // Expand Lhs/Rhs cells to 16 bit.
+        // Note: moving theses vmovls further down to allow for
+        // longer data pipelining helps a little on A57 but is
+        // harmful on A53 --- It looks as if A53 doesn't like
+        // interleaving vmovl's into the vmlal's.
+        "vmovl.u8 q0, d0\n"
+        "vmovl.u8 q1, d2\n"
+        "vmovl.u8 q2, d4\n"
+        "vmovl.u8 q3, d6\n"
+
+        // Multiply-accumulate, level of depth 0
+        "vmlal.u16 q4, d2, d0[0]\n"
+        "vmlal.u16 q5, d2, d0[1]\n"
+        "vmlal.u16 q6, d2, d0[2]\n"
+        "vmlal.u16 q7, d2, d0[3]\n"
+        "vldr d2, [%[lhs_ptr]]\n"
+        "vmlal.u16 q8, d4, d0[0]\n"
+        "vmlal.u16 q9, d4, d0[1]\n"
+        "vmlal.u16 q10, d4, d0[2]\n"
+        "vmlal.u16 q11, d4, d0[3]\n"
+        "vldr d4, [%[lhs_ptr], #8]\n"
+        "vmlal.u16 q12, d6, d0[0]\n"
+        "vmlal.u16 q13, d6, d0[1]\n"
+        "vmlal.u16 q14, d6, d0[2]\n"
+        "vmlal.u16 q15, d6, d0[3]\n"
+        "vldr d6, [%[lhs_ptr], #16]\n"
+        "vldr d0, [%[rhs_ptr]]\n"
+
+        // Multiply-accumulate, level of depth 1
+        "vmlal.u16 q4, d3, d1[0]\n"
+        "vmlal.u16 q5, d3, d1[1]\n"
+        "add %[lhs_ptr], #24\n"
+        "vmlal.u16 q6, d3, d1[2]\n"
+        "vmlal.u16 q7, d3, d1[3]\n"
+        "add %[rhs_ptr], #8\n"
+        "vmlal.u16 q8, d5, d1[0]\n"
+        "vmlal.u16 q9, d5, d1[1]\n"
+        "subs %[run_depth], #2\n"
+        "vmlal.u16 q10, d5, d1[2]\n"
+        "vmlal.u16 q11, d5, d1[3]\n"
+        "vmlal.u16 q12, d7, d1[0]\n"
+        "vmlal.u16 q13, d7, d1[1]\n"
+        "vmlal.u16 q14, d7, d1[2]\n"
+        "vmlal.u16 q15, d7, d1[3]\n"
+
+        "bne " GEMMLOWP_LABEL_LOOP "b\n"
+
+        GEMMLOWP_LABEL_AFTER_LOOP
+        ":\n"
+
+        // Do remaining arithmetic for the last 2 levels of depth.
 
         // Expand Lhs/Rhs cells to 16 bit.
         "vmovl.u8 q0, d0\n"
@@ -151,99 +247,27 @@
         "vmlal.u16 q14, d7, d1[2]\n"
         "vmlal.u16 q15, d7, d1[3]\n"
 
-        // Loop. Decrement loop index (depth) by 2, since we just handled 2
-        // levels of depth (Kernel::kDepth=2).
-        "subs %[run_depth], #2\n"
-        "bne " GEMMLOWP_LOOP_NEON_KERNEL_12X4_DEPTH2
-        "b\n"
-
-        /* end of main loop */
-
-        /* Accumulate our local accumulator registers into the destination block
-           */
-
-        // Compute stride between consecutive columns, in bytes
-        "mov r0, #4\n"  // multiply by 4 = sizeof(int32)
-        "mul %[dst_col_stride], r0\n"
-
-        // If start_depth == 0, then there is no preexisting accumulator
-        // to accumulate, so we can simply store our result.
-        "cmp %[start_depth], #0\n"
-        "beq " GEMMLOWP_STORE_RESULT_NEON_KERNEL_12X4_DEPTH2
-        "f\n"
-
-        "mov r0, %[dst_ptr]\n"
-
-        // Load a column
-        "mov r1, r0\n"
-        "vld1.32 {d0, d1}, [r1]!\n"
-        "vld1.32 {d2, d3}, [r1]!\n"
-        "vld1.32 {d4, d5}, [r1]!\n"
-        // Accumulate a column
-        "vadd.s32 q4, q4, q0\n"
-        "vadd.s32 q8, q8, q1\n"
-        "vadd.s32 q12, q12, q2\n"
-
-        "add r0, %[dst_col_stride]\n"
-        // Load a column
-        "mov r1, r0\n"
-        "vld1.32 {d0, d1}, [r1]!\n"
-        "vld1.32 {d2, d3}, [r1]!\n"
-        "vld1.32 {d4, d5}, [r1]!\n"
-        // Accumulate a column
-        "vadd.s32 q5, q5, q0\n"
-        "vadd.s32 q9, q9, q1\n"
-        "vadd.s32 q13, q13, q2\n"
-
-        "add r0, %[dst_col_stride]\n"
-        // Load a column
-        "mov r1, r0\n"
-        "vld1.32 {d0, d1}, [r1]!\n"
-        "vld1.32 {d2, d3}, [r1]!\n"
-        "vld1.32 {d4, d5}, [r1]!\n"
-        // Accumulate a column
-        "vadd.s32 q6, q6, q0\n"
-        "vadd.s32 q10, q10, q1\n"
-        "vadd.s32 q14, q14, q2\n"
-
-        "add r0, %[dst_col_stride]\n"
-        // Load a column
-        "mov r1, r0\n"
-        "vld1.32 {d0, d1}, [r1]!\n"
-        "vld1.32 {d2, d3}, [r1]!\n"
-        "vld1.32 {d4, d5}, [r1]!\n"
-        // Accumulate a column
-        "vadd.s32 q7, q7, q0\n"
-        "vadd.s32 q11, q11, q1\n"
-        "vadd.s32 q15, q15, q2\n"
-
-        GEMMLOWP_STORE_RESULT_NEON_KERNEL_12X4_DEPTH2
-        ":\n"
-
-        "mov r0, %[dst_ptr]\n"
-        // Store a column
-        "mov r1, r0\n"
-        "vst1.32 {d8, d9}, [r1]!\n"
-        "vst1.32 {d16, d17}, [r1]!\n"
-        "vst1.32 {d24, d25}, [r1]!\n"
-        // Store a column
-        "add r0, %[dst_col_stride]\n"
-        "mov r1, r0\n"
-        "vst1.32 {d10, d11}, [r1]!\n"
-        "vst1.32 {d18, d19}, [r1]!\n"
-        "vst1.32 {d26, d27}, [r1]!\n"
-        // Store a column
-        "add r0, %[dst_col_stride]\n"
-        "mov r1, r0\n"
-        "vst1.32 {d12, d13}, [r1]!\n"
-        "vst1.32 {d20, d21}, [r1]!\n"
-        "vst1.32 {d28, d29}, [r1]!\n"
-        // Store a column
-        "add r0, %[dst_col_stride]\n"
-        "mov r1, r0\n"
-        "vst1.32 {d14, d15}, [r1]!\n"
-        "vst1.32 {d22, d23}, [r1]!\n"
-        "vst1.32 {d30, d31}, [r1]!\n"
+        // Store accumulators
+        "mov r1, %[dst_ptr]\n"
+        "mov r0, r1\n"
+        "vst1.32 {d8, d9},   [r0]!\n"
+        "add r1, %[dst_col_stride]\n"
+        "vst1.32 {d16, d17}, [r0]!\n"
+        "vst1.32 {d24, d25}, [r0]\n"
+        "mov r0, r1\n"
+        "vst1.32 {d10, d11}, [r0]!\n"
+        "add r1, %[dst_col_stride]\n"
+        "vst1.32 {d18, d19}, [r0]!\n"
+        "vst1.32 {d26, d27}, [r0]\n"
+        "mov r0, r1\n"
+        "vst1.32 {d12, d13}, [r0]!\n"
+        "add r1, %[dst_col_stride]\n"
+        "vst1.32 {d20, d21}, [r0]!\n"
+        "vst1.32 {d28, d29}, [r0]\n"
+        "mov r0, r1\n"
+        "vst1.32 {d14, d15}, [r0]!\n"
+        "vst1.32 {d22, d23}, [r0]!\n"
+        "vst1.32 {d30, d31}, [r0]\n"
         :  // outputs
         [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
         [dst_ptr] "+r"(dst_ptr),
@@ -259,8 +283,10 @@
         "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
         "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
         "d31");
-#undef GEMMLOWP_LOOP_NEON_KERNEL_12X4_DEPTH2
-#undef GEMMLOWP_STORE_RESULT_NEON_KERNEL_12X4_DEPTH2
+#undef GEMMLOWP_LABEL_CLEAR_ACCUMULATORS
+#undef GEMMLOWP_LABEL_BEFORE_LOOP
+#undef GEMMLOWP_LABEL_LOOP
+#undef GEMMLOWP_LABEL_AFTER_LOOP
   }
 };
 
@@ -638,11 +664,604 @@
   }
 };
 
+struct NEON_32bit_GEMM_Int8Operands_LhsNonzero : KernelBase {
+  typedef KernelFormat<
+      KernelSideFormatInt8<CellFormat<4, 16, CellOrder::WidthMajor>, 1>,
+      KernelSideFormatInt8<CellFormat<2, 16, CellOrder::WidthMajor>, 1> >
+      Format;
+  const char* Name() const override {
+    return "NEON, 4x2, depth 16, accumulating two within signed int16";
+  }
+
+  // TODO(benoitjacob): reorder function arguments so dst comes last
+  void Run(std::int32_t* dst_ptr, std::size_t dst_row_stride,
+           std::size_t dst_col_stride, const std::uint8_t* lhs_ptr,
+           const std::uint8_t* rhs_ptr, std::size_t start_depth,
+           std::size_t run_depth) const override {
+#define GEMMLOWP_LABEL_AFTER_LOOP "1"
+#define GEMMLOWP_LABEL_LOOP "2"
+#define GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES "3"
+#define GEMMLOWP_LABEL_STORE "4"
+    asm volatile(
+        // Multiply dst_col_stride by 4 == sizeof(int32) to use
+        // it as a byte offset below.
+        "lsl %[dst_col_stride], %[dst_col_stride], #2\n"
+
+        // Overview of register layout:
+        //
+        // A 2x16 block of Rhs is stored in 8 bit in d0--d3.
+        // A 4x16 block of Lhs is stored in 8 bit in d4--d7. That is only
+        // half of the register space required, so we loop over these registers
+        // twice. Only half of it, a 2x16 block, is stored in d4--d7 at
+        // any given time.
+        //
+        // A 4x2 block of accumulators is stored in q8--q15 (as 4x32 bit
+        // components which need to be horizontally-added at the end)
+        //
+        // The Lhs vectors are multiplied by the Rhs vectors with a widening
+        // multiply over the 8 first levels of depth, producing int16x8
+        // vectors of products for each position in the accumulator matrix.
+        // Here comes the special trick: since the operands are signed int8,
+        // their range being [ -2^7 , 2^7 ), their products are in range
+        // [ -2^14 , 2^14 - 1 ), meaning that we can add two such values
+        // without any risk of overflowing int16.
+        // We thus proceed with the 8 next levels of depth, multiplying
+        // again Lhs by Rhs, accumulating into this existing int16x8 vector.
+        //
+        // Only then, having processed 16 levels of depth, do we need to
+        // horizontally add these int16x8 accumulators into the final
+        // int32x4 accumulators.
+        //
+        // As we do not have enough registers to store all 16 int16x8
+        // temporary-16bit-accumulators, we have them cycle through q4--q7.
+        //
+        //
+        // Register layout (ignoring the q4--q7 temporary 16bit accumulators):
+        //
+        //                               +----+----+
+        //                               | d0 | d2 |
+        //                               | .  | .  |
+        //                               | .  | .  |
+        //                               | .  | .  |
+        //                       Rhs     +----+----+
+        //                               | d1 | d3 |
+        //                               | .  | .  |
+        //                               | .  | .  |
+        //                               | .  | .  |
+        //                               +----+----+
+        //
+        //                               |    |    |
+        //
+        //    Lhs                        |    |    |
+        //
+        //  +--------+--------+ - - - -  +----+----+
+        //  | d4 ... | d5 ... |          | q8 | q9 |
+        //  | d6 ... | d7 ... |          | q10| q11|
+        //  | d4 ... | d5 ... |          | q12| q13|
+        //  | d6 ... | d7 ... |          | q14| q15|
+        //  +--------+--------+ - - - -  +----+----+
+        //
+        //                               Accumulator
+        //
+
+        // Clear accumulators, and, interleaved with it,
+        // initial loads of the first loop iteration,
+        // taken out of the loop so that in the loop itself we have
+        // optimal streaming of data from memory.
+        "vldr d0, [%[rhs_ptr], #0]\n"
+        "vmov.i32 q8, #0\n"
+        "vldr d4, [%[lhs_ptr], #0]\n"
+        "vmov.i32 q9, #0\n"
+        "vldr d2, [%[rhs_ptr], #16]\n"
+        "vmov.i32 q10, q8\n"
+        "vldr d6, [%[lhs_ptr], #16]\n"
+        "vmov.i32 q11, q8\n"
+        "vldr d1, [%[rhs_ptr], #8]\n"
+        "vmov.i32 q12, q8\n"
+        "vldr d5, [%[lhs_ptr], #8]\n"
+        "vmov.i32 q13, q8\n"
+        "vldr d3, [%[rhs_ptr], #24]\n"
+        "vmov.i32 q14, q8\n"
+        "vldr d7, [%[lhs_ptr], #24]\n"
+        "vmov.i32 q15, q8\n"
+
+        // General loop.
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Multiply 8 first levels of depth.
+        "vmull.s8    q4,  d0,  d4\n"
+        "add %[rhs_ptr], %[rhs_ptr], #32\n"
+        "vmull.s8    q5,  d2,  d4\n"
+        "vldr d4, [%[lhs_ptr], #32]\n"
+        "vmull.s8    q6,  d0,  d6\n"
+        "vmull.s8    q7,  d2,  d6\n"
+        "vldr d6, [%[lhs_ptr], #48]\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "vmlal.s8    q4,  d1,  d5\n"
+        "vmlal.s8    q5,  d3,  d5\n"
+        "vldr d5, [%[lhs_ptr], #40]\n"
+        "vmlal.s8    q6,  d1,  d7\n"
+        "vmlal.s8    q7,  d3,  d7\n"
+        "vldr d7, [%[lhs_ptr], #56]\n"
+
+        // Add pairwise, accumulate into 32-bit accumulators.
+        "vpadal.s16   q8,  q4\n"
+        "add %[lhs_ptr], %[lhs_ptr], #64\n"
+        "vpadal.s16   q9,  q5\n"
+        "subs %[run_depth], %[run_depth], #16\n"
+        "vpadal.s16   q10, q6\n"
+        "vpadal.s16   q11, q7\n"
+
+        "beq " GEMMLOWP_LABEL_AFTER_LOOP
+        "f\n"
+
+        // Multiply first half.
+        "vmull.s8    q4,  d0,  d4\n"
+        "vmull.s8    q5,  d2,  d4\n"
+        "vldr d4, [%[lhs_ptr], #0]\n"
+        "vmull.s8    q6,  d0,  d6\n"
+        "vldr d0, [%[rhs_ptr], #0]\n"
+        "vmull.s8    q7,  d2,  d6\n"
+        "vldr d2, [%[rhs_ptr], #16]\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "vmlal.s8    q4,  d1,  d5\n"
+        "vldr d6, [%[lhs_ptr], #16]\n"
+        "vmlal.s8    q5,  d3,  d5\n"
+        "vldr d5, [%[lhs_ptr], #8]\n"
+        "vmlal.s8    q6,  d1,  d7\n"
+        "vldr d1, [%[rhs_ptr], #8]\n"
+        "vmlal.s8    q7,  d3,  d7\n"
+        "vldr d3, [%[rhs_ptr], #24]\n"
+
+        // Add pairwise, accumulate into 32-bit accumulators.
+        "vpadal.s16   q12, q4\n"
+        "vldr d7, [%[lhs_ptr], #24]\n"
+        "vpadal.s16   q13, q5\n"
+        "vpadal.s16   q14, q6\n"
+        "vpadal.s16   q15, q7\n"
+
+        "b " GEMMLOWP_LABEL_LOOP "b\n"
+
+        GEMMLOWP_LABEL_AFTER_LOOP
+        ":\n"
+
+        // Multiply first half.
+        "vmull.s8    q4,  d0,  d4\n"
+        "vmull.s8    q5,  d2,  d4\n"
+        "vmull.s8    q6,  d0,  d6\n"
+        "vmull.s8    q7,  d2,  d6\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "vmlal.s8    q4,  d1,  d5\n"
+        "vmlal.s8    q5,  d3,  d5\n"
+        "vmlal.s8    q6,  d1,  d7\n"
+        "vmlal.s8    q7,  d3,  d7\n"
+
+        // Add pairwise, accumulate into 32-bit accumulators.
+        "vpadal.s16   q12, q4\n"
+        "vpadal.s16   q13, q5\n"
+        "vpadal.s16   q14, q6\n"
+        "vpadal.s16   q15, q7\n"
+        "cmp %[start_depth], #0\n"
+
+        // Reduce 32bit accumulators horizontally.
+        "vpadd.s32 d0, d16, d17\n"
+        "vpadd.s32 d1, d18, d19\n"
+        "vpadd.s32 d2, d20, d21\n"
+        "vpadd.s32 d3, d22, d23\n"
+        "vpadd.s32 d4, d24, d25\n"
+        "vpadd.s32 d5, d26, d27\n"
+        "vpadd.s32 d6, d28, d29\n"
+        "vpadd.s32 d7, d30, d31\n"
+
+        "bne " GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES
+        "f\n"
+
+        // Reduce 32bit accumulators horizontally, second pass
+        // (each pass adds pairwise. we need to add 4-wise).
+        "vpadd.s32 d8, d0, d2\n"
+        "vpadd.s32 d9, d4, d6\n"
+        "vpadd.s32 d10, d1, d3\n"
+        "vpadd.s32 d11, d5, d7\n"
+
+        "b " GEMMLOWP_LABEL_STORE "f\n"
+
+        GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES
+        ":\n"
+
+        // Reduce 32bit accumulators horizontally, second pass
+        // (each pass adds pairwise. we need to add 4-wise),
+        // and load destination values from memory.
+        "mov r0, %[dst_ptr]\n"
+        "vld1.32 {d16, d17}, [r0], %[dst_col_stride]\n"
+        "vpadd.s32 d8, d0, d2\n"
+        "vpadd.s32 d9, d4, d6\n"
+        "vld1.32 {d18, d19}, [r0]\n"
+        "vpadd.s32 d10, d1, d3\n"
+        "vpadd.s32 d11, d5, d7\n"
+
+        // Add horizontally-reduced accumulators into
+        // the values loaded from memory
+        "vadd.s32 q4, q8, q4\n"
+        "vadd.s32 q5, q9, q5\n"
+
+        GEMMLOWP_LABEL_STORE
+        ":\n"
+        // Store back into memory
+        "mov r0, %[dst_ptr]\n"
+        "vst1.32 {d8, d9}, [r0], %[dst_col_stride]\n"
+        "vst1.32 {d10, d11}, [r0]\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [dst_ptr] "+r"(dst_ptr), [run_depth] "+r"(run_depth)
+        :  // inputs
+        [start_depth] "r"(start_depth),
+        [dst_col_stride] "r"(dst_col_stride)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+#undef GEMMLOWP_LABEL_LOOP
+#undef GEMMLOWP_LABEL_AFTER_LOOP
+#undef GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES
+#undef GEMMLOWP_LABEL_STORE
+  }
+};
+
 #endif  // GEMMLOWP_NEON_32
 
 // The kernels here are specifically arm 64bit assembly, not arm 32bit.
 #ifdef GEMMLOWP_NEON_64
 
+struct NEON_64bit_GEMM_Int8Operands_LhsNonzero : KernelBase {
+  typedef KernelFormat<
+      KernelSideFormatInt8<CellFormat<4, 16, CellOrder::WidthMajor>, 1>,
+      KernelSideFormatInt8<CellFormat<4, 16, CellOrder::WidthMajor>, 1> >
+      Format;
+  const char* Name() const override {
+    return "NEON, 4x4, depth 16, accumulating two within signed int16";
+  }
+
+  // TODO(benoitjacob): reorder function arguments so dst comes last
+  void Run(std::int32_t* dst_ptr, std::size_t dst_row_stride,
+           std::size_t dst_col_stride, const std::uint8_t* lhs_ptr,
+           const std::uint8_t* rhs_ptr, std::size_t start_depth,
+           std::size_t run_depth) const override {
+#define GEMMLOWP_LABEL_AFTER_LOOP_LAST16 "1"
+#define GEMMLOWP_LABEL_LOOP "2"
+#define GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES "3"
+#define GEMMLOWP_LABEL_STORE "4"
+    asm volatile(
+        // Clear accumulators, and, interleaved with it,
+        // initial loads of the first loop iteration,
+        // taken out of the loop so that in the loop itself we have
+        // optimal streaming of data from memory.
+        "ld1 {v0.16b}, [%[rhs_ptr]], #16\n"
+        "dup v16.4s, wzr\n"
+        "ld1 {v4.16b}, [%[lhs_ptr]], #16\n"
+        "dup v17.4s, wzr\n"
+        "ld1 {v1.16b}, [%[rhs_ptr]], #16\n"
+        "dup v18.4s, wzr\n"
+        "ld1 {v5.16b}, [%[lhs_ptr]], #16\n"
+        "dup v19.4s, wzr\n"
+        "ld1 {v2.16b}, [%[rhs_ptr]], #16\n"
+        "dup v20.4s, wzr\n"
+        "ld1 {v3.16b}, [%[rhs_ptr]], #16\n"
+        "dup v21.4s, wzr\n"
+        "ld1 {v6.16b}, [%[lhs_ptr]], #16\n"
+        "dup v22.4s, wzr\n"
+        "ld1 {v7.16b}, [%[lhs_ptr]], #16\n"
+        "dup v23.4s, wzr\n"
+        "dup v24.4s, wzr\n"
+        "dup v25.4s, wzr\n"
+        "dup v26.4s, wzr\n"
+        "dup v27.4s, wzr\n"
+        "dup v28.4s, wzr\n"
+        "dup v29.4s, wzr\n"
+        "dup v30.4s, wzr\n"
+        "dup v31.4s, wzr\n"
+
+        // Multiply dst_col_stride by 4 == sizeof(int32) to use
+        // it as a byte offset below.
+        "lsl %[dst_col_stride], %[dst_col_stride], #2\n"
+
+        // Initial arithmetic of the first loop iteration,
+        // taken out of the loop so that in the loop itself we have
+        // optimal streaming of data from memory.
+        "smull    v8.8h,  v0.8b,  v4.8b\n"
+        "smull    v9.8h,  v1.8b,  v4.8b\n"
+        "smull    v10.8h,  v2.8b,  v4.8b\n"
+        "smull    v11.8h,  v3.8b,  v4.8b\n"
+        "smull    v12.8h,  v0.8b,  v5.8b\n"
+        "smull    v13.8h,  v1.8b,  v5.8b\n"
+        "smull    v14.8h,  v2.8b,  v5.8b\n"
+        "smull    v15.8h,  v3.8b,  v5.8b\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "smlal2   v8.8h,  v0.16b,  v4.16b\n"
+        "smlal2   v9.8h,  v1.16b,  v4.16b\n"
+        "smlal2   v10.8h,  v2.16b,  v4.16b\n"
+        "smlal2   v11.8h,  v3.16b,  v4.16b\n"
+        "smlal2   v12.8h,  v0.16b,  v5.16b\n"
+        "smlal2   v13.8h,  v1.16b,  v5.16b\n"
+        "smlal2   v14.8h,  v2.16b,  v5.16b\n"
+        "smlal2   v15.8h,  v3.16b,  v5.16b\n"
+
+        "subs %[run_depth], %[run_depth], #16\n"
+
+        // If the loop depth is only 16, then we can skip the general loop
+        // and go straight to the final part of the code.
+        "beq " GEMMLOWP_LABEL_AFTER_LOOP_LAST16 "f\n"
+
+        // General loop.
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Overview of register layout:
+        //
+        // A 4x16 block of Rhs is stored in 8 bit in v0--v3.
+        // A 4x16 block of Lhs is stored in 8 bit in v4--v7.
+        //
+        // A 4x4 block of accumulators is stored in v16-v31 (as 4x32 bit
+        // components which need to be horizontally-added at the end)
+        //
+        // The Lhs vectors are multiplied by the Rhs vectors with a widening
+        // multiply over the 8 first levels of depth, producing int16x8
+        // vectors of products for each position in the accumulator matrix.
+        // Here comes the special trick: since the operands are signed int8,
+        // their range being [ -2^7 , 2^7 ), their products are in range
+        // [ -2^14 , 2^14 - 1 ), meaning that we can add two such values
+        // without any risk of overflowing int16.
+        // We thus proceed with the 8 next levels of depth, multiplying
+        // again Lhs by Rhs, accumulating into this existing int16x8 vector.
+        //
+        // Only then, having processed 16 levels of depth, do we need to
+        // horizontally add these int16x8 accumulators into the final
+        // int32x4 accumulators.
+        //
+        // As we do not have enough registers to store all 16 int16x8
+        // temporary-16bit-accumulators, we have them cycle through v8--v15.
+        //
+        //
+        // Register layout (ignoring the v8--v15 temporary 16bit accumulators):
+        //
+        //                               +--------+--------+--------+--------+
+        //                               |v0.b[0] |v1.b[0] |v2.b[0] |v3.b[0] |
+        //                          Rhs  +--------+--------+--------+--------+
+        //                               |  ...   |  ...   |  ...   |  ...   |
+        //                               +--------+--------+--------+--------|
+        //                               |v0.b[15]|v1.b[15]|v2.b[15]|v3.b[15]|
+        //                               +--------+--------+--------+--------+
+        //
+        //                               |        |        |        |        |
+        //
+        //    Lhs                        |        |        |        |        |
+        //
+        //  +-------+-----+--------+ - - +--------+--------+--------+--------+
+        //  |v4.b[0]| ... |v4.b[15]|     | v16.4s | v17.4s | v18.4s | v19.4s |
+        //  |v5.b[0]| ... |v5.b[15]|     | v20.4s | v21.4s | v22.4s | v23.4s |
+        //  |v6.b[0]| ... |v6.b[15]|     | v24.4s | v25.4s | v26.4s | v27.4s |
+        //  |v7.b[0]| ... |v7.b[15]|     | v28.4s | v29.4s | v30.4s | v31.4s |
+        //  +-------+--------------+ - - +--------+--------+--------+--------+
+        //
+        //                                                Accumulator
+        //
+
+        // Some multiplications and 16-bit accumulation were already done above,
+        // so we start right away in the middle.
+        "sadalp  v16.4s, v8.8h\n"
+        "ld1 {v4.16b}, [%[lhs_ptr]], #16\n"
+        "smull    v8.8h,  v0.8b,  v6.8b\n"
+        "sadalp  v17.4s, v9.8h\n"
+        "ld1 {v5.16b}, [%[lhs_ptr]], #16\n"
+        "smull    v9.8h,  v1.8b,  v6.8b\n"
+        "sadalp  v18.4s, v10.8h\n"
+        "smull    v10.8h,  v2.8b,  v6.8b\n"
+        "sadalp  v19.4s, v11.8h\n"
+        "smull    v11.8h,  v3.8b,  v6.8b\n"
+        "sadalp  v20.4s, v12.8h\n"
+        "smull    v12.8h,  v0.8b,  v7.8b\n"
+        "sadalp  v21.4s, v13.8h\n"
+        "smull    v13.8h,  v1.8b,  v7.8b\n"
+        "sadalp  v22.4s, v14.8h\n"
+        "smull    v14.8h,  v2.8b,  v7.8b\n"
+        "sadalp  v23.4s, v15.8h\n"
+        "smull    v15.8h,  v3.8b,  v7.8b\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "smlal2   v8.8h,  v0.16b,  v6.16b\n"
+        "smlal2   v9.8h,  v1.16b,  v6.16b\n"
+        "smlal2   v10.8h,  v2.16b,  v6.16b\n"
+        "smlal2   v11.8h,  v3.16b,  v6.16b\n"
+
+        "ld1 {v6.16b}, [%[lhs_ptr]], #16\n"
+
+        "smlal2   v12.8h,  v0.16b,  v7.16b\n"
+        "ld1 {v0.16b}, [%[rhs_ptr]], #16\n"
+        "smlal2   v13.8h,  v1.16b,  v7.16b\n"
+        "ld1 {v1.16b}, [%[rhs_ptr]], #16\n"
+        "smlal2   v14.8h,  v2.16b,  v7.16b\n"
+        "ld1 {v2.16b}, [%[rhs_ptr]], #16\n"
+        "smlal2   v15.8h,  v3.16b,  v7.16b\n"
+        "ld1 {v3.16b}, [%[rhs_ptr]], #16\n"
+
+        "sadalp  v24.4s, v8.8h\n"
+        "smull    v8.8h,  v0.8b,  v4.8b\n"
+        "sadalp  v25.4s, v9.8h\n"
+        "ld1 {v7.16b}, [%[lhs_ptr]], #16\n"
+        "smull    v9.8h,  v1.8b,  v4.8b\n"
+        "sadalp  v26.4s, v10.8h\n"
+        "smull    v10.8h,  v2.8b,  v4.8b\n"
+        "sadalp  v27.4s, v11.8h\n"
+        "smull    v11.8h,  v3.8b,  v4.8b\n"
+        "sadalp  v28.4s, v12.8h\n"
+        "smull    v12.8h,  v0.8b,  v5.8b\n"
+        "sadalp  v29.4s, v13.8h\n"
+        "smull    v13.8h,  v1.8b,  v5.8b\n"
+        "sadalp  v30.4s, v14.8h\n"
+        "smull    v14.8h,  v2.8b,  v5.8b\n"
+        "sadalp  v31.4s, v15.8h\n"
+        "smull    v15.8h,  v3.8b,  v5.8b\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "smlal2   v8.8h,  v0.16b,  v4.16b\n"
+        "smlal2   v9.8h,  v1.16b,  v4.16b\n"
+        "smlal2   v10.8h,  v2.16b,  v4.16b\n"
+        "smlal2   v11.8h,  v3.16b,  v4.16b\n"
+
+        // Loop. Decrement loop index (depth) by 16, since we just handled
+        // 16 levels of depth.  Do this subs a bit before the end of the loop
+        // for better dispatch on A57.
+        "subs %[run_depth], %[run_depth], #16\n"
+
+        "smlal2   v12.8h,  v0.16b,  v5.16b\n"
+        "smlal2   v13.8h,  v1.16b,  v5.16b\n"
+        "smlal2   v14.8h,  v2.16b,  v5.16b\n"
+        "smlal2   v15.8h,  v3.16b,  v5.16b\n"
+
+        "bne " GEMMLOWP_LABEL_LOOP "b\n"
+
+        // Final code for the last 16 levels of depth.
+        // There is nothing to load anymore, only some arithmetic to finish.
+        GEMMLOWP_LABEL_AFTER_LOOP_LAST16
+        ":\n"
+
+        // Some multiplications and 16-bit accumulation were already done above,
+        // so we start right away in the middle.
+        "sadalp  v16.4s, v8.8h\n"
+        "smull    v8.8h,  v0.8b,  v6.8b\n"
+        "sadalp  v17.4s, v9.8h\n"
+        "smull    v9.8h,  v1.8b,  v6.8b\n"
+        "sadalp  v18.4s, v10.8h\n"
+        "smull    v10.8h,  v2.8b,  v6.8b\n"
+        "sadalp  v19.4s, v11.8h\n"
+        "smull    v11.8h,  v3.8b,  v6.8b\n"
+        "sadalp  v20.4s, v12.8h\n"
+        "smull    v12.8h,  v0.8b,  v7.8b\n"
+        "sadalp  v21.4s, v13.8h\n"
+        "smull    v13.8h,  v1.8b,  v7.8b\n"
+        "sadalp  v22.4s, v14.8h\n"
+        "smull    v14.8h,  v2.8b,  v7.8b\n"
+        "sadalp  v23.4s, v15.8h\n"
+        "smull    v15.8h,  v3.8b,  v7.8b\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "smlal2   v8.8h,  v0.16b,  v6.16b\n"
+        "smlal2   v9.8h,  v1.16b,  v6.16b\n"
+        "smlal2   v10.8h,  v2.16b,  v6.16b\n"
+        "smlal2   v11.8h,  v3.16b,  v6.16b\n"
+        "smlal2   v12.8h,  v0.16b,  v7.16b\n"
+        "smlal2   v13.8h,  v1.16b,  v7.16b\n"
+        "smlal2   v14.8h,  v2.16b,  v7.16b\n"
+        "smlal2   v15.8h,  v3.16b,  v7.16b\n"
+
+        "sadalp  v24.4s, v8.8h\n"
+        "sadalp  v25.4s, v9.8h\n"
+        "sadalp  v26.4s, v10.8h\n"
+        "sadalp  v27.4s, v11.8h\n"
+        "sadalp  v28.4s, v12.8h\n"
+        "sadalp  v29.4s, v13.8h\n"
+        "sadalp  v30.4s, v14.8h\n"
+        "sadalp  v31.4s, v15.8h\n"
+
+        // Reduce 32bit accumulators horizontally.
+        "addp v0.4s, v16.4s, v20.4s\n"
+        "addp v2.4s, v17.4s, v21.4s\n"
+        "addp v4.4s, v18.4s, v22.4s\n"
+        "addp v6.4s, v19.4s, v23.4s\n"
+        "addp v1.4s, v24.4s, v28.4s\n"
+        "addp v3.4s, v25.4s, v29.4s\n"
+        "addp v5.4s, v26.4s, v30.4s\n"
+        "addp v7.4s, v27.4s, v31.4s\n"
+
+        "cmp %[start_depth], #0\n"
+        "bne " GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES
+        "f\n"
+
+        // Reduce 32bit accumulators horizontally, second pass
+        // (each pass adds pairwise. we need to add 4-wise).
+        "addp v12.4s, v0.4s, v1.4s\n"
+        "addp v13.4s, v2.4s, v3.4s\n"
+        "addp v14.4s, v4.4s, v5.4s\n"
+        "addp v15.4s, v6.4s, v7.4s\n"
+
+        "b " GEMMLOWP_LABEL_STORE "f\n"
+
+        GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES
+        ":\n"
+
+        // Reduce 32bit accumulators horizontally, second pass
+        // (each pass adds pairwise. we need to add 4-wise),
+        // and load destination values from memory.
+        "mov x0, %[dst_ptr]\n"
+        "ld1 {v12.16b}, [x0], %[dst_col_stride]\n"
+        "addp v8.4s, v0.4s, v1.4s\n"
+        "ld1 {v13.16b}, [x0], %[dst_col_stride]\n"
+        "addp v9.4s, v2.4s, v3.4s\n"
+        "ld1 {v14.16b}, [x0], %[dst_col_stride]\n"
+        "addp v10.4s, v4.4s, v5.4s\n"
+        "ld1 {v15.16b}, [x0]\n"
+        "addp v11.4s, v6.4s, v7.4s\n"
+
+        // Add horizontally-reduced accumulators into
+        // the values loaded from memory
+        "add v12.4s, v12.4s, v8.4s\n"
+        "add v13.4s, v13.4s, v9.4s\n"
+        "add v14.4s, v14.4s, v10.4s\n"
+        "add v15.4s, v15.4s, v11.4s\n"
+
+        GEMMLOWP_LABEL_STORE
+        ":\n"
+        // Store back into memory
+        "mov x0, %[dst_ptr]\n"
+        "st1 {v12.16b}, [x0], %[dst_col_stride]\n"
+        "st1 {v13.16b}, [x0], %[dst_col_stride]\n"
+        "st1 {v14.16b}, [x0], %[dst_col_stride]\n"
+        "st1 {v15.16b}, [x0]\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [dst_ptr] "+r"(dst_ptr), [run_depth] "+r"(run_depth),
+        [dst_col_stride] "+r"(dst_col_stride)
+        :  // inputs
+        [start_depth] "r"(start_depth)
+        :  // clobbers
+        "cc", "memory", "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+        "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",
+        "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27",
+        "v28", "v29", "v30", "v31");
+#undef GEMMLOWP_LABEL_LOOP
+#undef GEMMLOWP_LABEL_AFTER_LOOP_LAST16
+#undef GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES
+#undef GEMMLOWP_LABEL_STORE
+  }
+};
+
+
 // Our main GEMM kernel.
 struct NEON_64_Kernel12x8Depth2 : KernelBase {
   typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, 3>,
@@ -658,13 +1277,81 @@
            std::size_t run_depth) const override {
     ScopedProfilingLabel label("optimized kernel (NEON 12x8)");
 // See comments above for why we need local numerical labels in our asm.
-#define GEMMLOWP_LOOP_NEON_64_KERNEL_12X8_DEPTH2 "1"
-#define GEMMLOWP_STORE_RESULT_NEON_64_KERNEL_12x8_DEPTH2 "2"
+#define GEMMLOWP_LABEL_CLEAR_ACCUMULATORS "1"
+#define GEMMLOWP_LABEL_BEFORE_LOOP "2"
+#define GEMMLOWP_LABEL_LOOP "3"
+#define GEMMLOWP_LABEL_AFTER_LOOP "4"
 
     assert(dst_row_stride == 1);
     asm volatile(
+        // Load 1 Rhs cell of size 2x8
+        "ld1 {v5.8b}, [%[rhs_ptr]], #8\n"
+        "ld1 {v6.8b}, [%[rhs_ptr]], #8\n"
+
+        // Load 3 Lhs cells of size 4x2 each
+        "ld1 {v2.8b}, [%[lhs_ptr]], #8\n"
+        "ld1 {v3.8b}, [%[lhs_ptr]], #8\n"
+        "ld1 {v4.8b}, [%[lhs_ptr]], #8\n"
+
+        // Multiply dst_col_stride by 4 == sizeof(int32) to use
+        // it as a byte offset below.
+        "lsl %[dst_col_stride], %[dst_col_stride], #2\n"
+
+        "cmp %[start_depth], #0\n"
+        "beq " GEMMLOWP_LABEL_CLEAR_ACCUMULATORS
+        "f\n"
+
+        // Load accumulators
+        "mov x1, %[dst_ptr]\n"
+        "mov x0, x1\n"
+        "ld1 {v8.16b}, [x0], #16\n"
+        "subs %[run_depth], %[run_depth], #2\n"
+        "ld1 {v16.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "ld1 {v24.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "ld1 {v9.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "ld1 {v17.16b}, [x0], #16\n"
+        "ld1 {v25.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "ld1 {v10.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "ld1 {v18.16b}, [x0], #16\n"
+        "ld1 {v26.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "ld1 {v11.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "ld1 {v19.16b}, [x0], #16\n"
+        "ld1 {v27.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "ld1 {v12.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "ld1 {v20.16b}, [x0], #16\n"
+        "ld1 {v28.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "ld1 {v13.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "ld1 {v21.16b}, [x0], #16\n"
+        "ld1 {v29.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "ld1 {v14.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "ld1 {v22.16b}, [x0], #16\n"
+        "ld1 {v30.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "ld1 {v15.16b}, [x0], #16\n"
+        "ld1 {v23.16b}, [x0], #16\n"
+        "ld1 {v31.16b}, [x0]\n"
+
+        "b " GEMMLOWP_LABEL_BEFORE_LOOP "f\n"
+
+        GEMMLOWP_LABEL_CLEAR_ACCUMULATORS
+        ":\n"
+
         // Clear accumulator registers (see layout below)
         "dup v8.4s, wzr\n"
+        "subs %[run_depth], %[run_depth], #2\n"
         "dup v9.4s, wzr\n"
         "dup v10.4s, wzr\n"
         "dup v11.4s, wzr\n"
@@ -689,9 +1376,12 @@
         "dup v30.4s, wzr\n"
         "dup v31.4s, wzr\n"
 
-        /* Main loop */
+        GEMMLOWP_LABEL_BEFORE_LOOP
+        ":\n"
 
-        GEMMLOWP_LOOP_NEON_64_KERNEL_12X8_DEPTH2
+        "beq " GEMMLOWP_LABEL_AFTER_LOOP "f\n"
+
+        GEMMLOWP_LABEL_LOOP
         ":\n"
 
         // Overview of register layout:
@@ -729,18 +1419,82 @@
         //
         //                            Accumulator
 
-        // Load 1 Rhs cell of size 2x8
-        "ld1 {v0.8b}, [%[rhs_ptr]], #8\n"
-        "ld1 {v1.8b}, [%[rhs_ptr]], #8\n"
+        // Expand Lhs/Rhs cells to 16 bit.
+        "uxtl v0.8h, v5.8b\n"
+        "ld1 {v5.8b}, [%[rhs_ptr]], #8\n"
+        "uxtl v1.8h, v6.8b\n"
+        "ld1 {v6.8b}, [%[rhs_ptr]], #8\n"
+        "uxtl v2.8h, v2.8b\n"
+        "uxtl v3.8h, v3.8b\n"
+        "uxtl v4.8h, v4.8b\n"
 
-        // Load 3 Lhs cells of size 4x2 each
+        // Multiply-accumulate, top third
+        "umlal v8.4s, v2.4h, v0.h[0]\n"
+        "umlal v9.4s, v2.4h, v0.h[1]\n"
+        "umlal v10.4s, v2.4h, v0.h[2]\n"
+        "umlal v11.4s, v2.4h, v0.h[3]\n"
+        "umlal v12.4s, v2.4h, v1.h[0]\n"
+        "umlal v13.4s, v2.4h, v1.h[1]\n"
+        "umlal v14.4s, v2.4h, v1.h[2]\n"
+        "umlal v15.4s, v2.4h, v1.h[3]\n"
+        "umlal2 v8.4s, v2.8h, v0.h[4]\n"
+        "umlal2 v9.4s, v2.8h, v0.h[5]\n"
+        "umlal2 v10.4s, v2.8h, v0.h[6]\n"
+        "umlal2 v11.4s, v2.8h, v0.h[7]\n"
+        "umlal2 v12.4s, v2.8h, v1.h[4]\n"
+        "umlal2 v13.4s, v2.8h, v1.h[5]\n"
+        "umlal2 v14.4s, v2.8h, v1.h[6]\n"
+        "umlal2 v15.4s, v2.8h, v1.h[7]\n"
         "ld1 {v2.8b}, [%[lhs_ptr]], #8\n"
+
+        // Multiply-accumulate, middle third
+        "umlal v16.4s, v3.4h, v0.h[0]\n"
+        "umlal v17.4s, v3.4h, v0.h[1]\n"
+        "umlal v18.4s, v3.4h, v0.h[2]\n"
+        "umlal v19.4s, v3.4h, v0.h[3]\n"
+        "umlal v20.4s, v3.4h, v1.h[0]\n"
+        "umlal v21.4s, v3.4h, v1.h[1]\n"
+        "umlal v22.4s, v3.4h, v1.h[2]\n"
+        "umlal v23.4s, v3.4h, v1.h[3]\n"
+        "umlal2 v16.4s, v3.8h, v0.h[4]\n"
+        "umlal2 v17.4s, v3.8h, v0.h[5]\n"
+        "umlal2 v18.4s, v3.8h, v0.h[6]\n"
+        "umlal2 v19.4s, v3.8h, v0.h[7]\n"
+        "umlal2 v20.4s, v3.8h, v1.h[4]\n"
+        "umlal2 v21.4s, v3.8h, v1.h[5]\n"
+        "umlal2 v22.4s, v3.8h, v1.h[6]\n"
+        "umlal2 v23.4s, v3.8h, v1.h[7]\n"
         "ld1 {v3.8b}, [%[lhs_ptr]], #8\n"
+
+        "subs %[run_depth], %[run_depth], #2\n"
+
+        // Multiply-accumulate, bottom third
+        "umlal v24.4s, v4.4h, v0.h[0]\n"
+        "umlal v25.4s, v4.4h, v0.h[1]\n"
+        "umlal v26.4s, v4.4h, v0.h[2]\n"
+        "umlal v27.4s, v4.4h, v0.h[3]\n"
+        "umlal v28.4s, v4.4h, v1.h[0]\n"
+        "umlal v29.4s, v4.4h, v1.h[1]\n"
+        "umlal v30.4s, v4.4h, v1.h[2]\n"
+        "umlal v31.4s, v4.4h, v1.h[3]\n"
+        "umlal2 v24.4s, v4.8h, v0.h[4]\n"
+        "umlal2 v25.4s, v4.8h, v0.h[5]\n"
+        "umlal2 v26.4s, v4.8h, v0.h[6]\n"
+        "umlal2 v27.4s, v4.8h, v0.h[7]\n"
+        "umlal2 v28.4s, v4.8h, v1.h[4]\n"
+        "umlal2 v29.4s, v4.8h, v1.h[5]\n"
+        "umlal2 v30.4s, v4.8h, v1.h[6]\n"
+        "umlal2 v31.4s, v4.8h, v1.h[7]\n"
         "ld1 {v4.8b}, [%[lhs_ptr]], #8\n"
 
+        "bne " GEMMLOWP_LABEL_LOOP "b\n"
+
+        GEMMLOWP_LABEL_AFTER_LOOP
+        ":\n"
+
         // Expand Lhs/Rhs cells to 16 bit.
-        "uxtl v0.8h, v0.8b\n"
-        "uxtl v1.8h, v1.8b\n"
+        "uxtl v0.8h, v5.8b\n"
+        "uxtl v1.8h, v6.8b\n"
         "uxtl v2.8h, v2.8b\n"
         "uxtl v3.8h, v3.8b\n"
         "uxtl v4.8h, v4.8b\n"
@@ -797,167 +1551,52 @@
         "umlal2 v30.4s, v4.8h, v1.h[6]\n"
         "umlal2 v31.4s, v4.8h, v1.h[7]\n"
 
-        // Loop. Decrement loop index (depth) by 2, since we just handled 2
-        // levels of depth (Kernel::kDepth=2).
+        // Store accumulators
+        "mov x1, %[dst_ptr]\n"
+        "mov x0, x1\n"
+        "st1 {v8.16b}, [x0], #16\n"
         "subs %[run_depth], %[run_depth], #2\n"
-        "bne " GEMMLOWP_LOOP_NEON_64_KERNEL_12X8_DEPTH2
-        "b\n"
-
-        /* end of main loop */
-
-        /* Accumulate our local accumulator registers into the destination block
-           */
-
-        // Compute stride between consecutive columns, in bytes
-        "mov x0, #4\n"  // multiply by 4 = sizeof(int32)
-        "mul %[dst_col_stride], %[dst_col_stride], x0\n"
-
-        // If start_depth == 0, then there is no preexisting accumulator
-        // to accumulate, so we can simply store our result.
-        "cmp %[start_depth], #0\n"
-        "beq " GEMMLOWP_STORE_RESULT_NEON_64_KERNEL_12x8_DEPTH2
-        "f\n"
-
-        "mov x0, %[dst_ptr]\n"
-
-        // Load a column
-        "mov x1, x0\n"
-        "ld1 {v0.4s}, [x1], #16\n"
-        "ld1 {v1.4s}, [x1], #16\n"
-        "ld1 {v2.4s}, [x1], #16\n"
-        // Accumulate a column
-        "add v8.4s, v8.4s, v0.4s\n"
-        "add v16.4s, v16.4s, v1.4s\n"
-        "add v24.4s, v24.4s, v2.4s\n"
-
-        "add x0, x0, %[dst_col_stride]\n"
-        // Load a column
-        "mov x1, x0\n"
-        "ld1 {v0.4s}, [x1], #16\n"
-        "ld1 {v1.4s}, [x1], #16\n"
-        "ld1 {v2.4s}, [x1], #16\n"
-        // Accumulate a column
-        "add v9.4s, v9.4s, v0.4s\n"
-        "add v17.4s, v17.4s, v1.4s\n"
-        "add v25.4s, v25.4s, v2.4s\n"
-
-        "add x0, x0, %[dst_col_stride]\n"
-        // Load a column
-        "mov x1, x0\n"
-        "ld1 {v0.4s}, [x1], #16\n"
-        "ld1 {v1.4s}, [x1], #16\n"
-        "ld1 {v2.4s}, [x1], #16\n"
-        // Accumulate a column
-        "add v10.4s, v10.4s, v0.4s\n"
-        "add v18.4s, v18.4s, v1.4s\n"
-        "add v26.4s, v26.4s, v2.4s\n"
-
-        "add x0, x0, %[dst_col_stride]\n"
-        // Load a column
-        "mov x1, x0\n"
-        "ld1 {v0.4s}, [x1], #16\n"
-        "ld1 {v1.4s}, [x1], #16\n"
-        "ld1 {v2.4s}, [x1], #16\n"
-        // Accumulate a column
-        "add v11.4s, v11.4s, v0.4s\n"
-        "add v19.4s, v19.4s, v1.4s\n"
-        "add v27.4s, v27.4s, v2.4s\n"
-
-        "add x0, x0, %[dst_col_stride]\n"
-        // Load a column
-        "mov x1, x0\n"
-        "ld1 {v0.4s}, [x1], #16\n"
-        "ld1 {v1.4s}, [x1], #16\n"
-        "ld1 {v2.4s}, [x1], #16\n"
-        // Accumulate a column
-        "add v12.4s, v12.4s, v0.4s\n"
-        "add v20.4s, v20.4s, v1.4s\n"
-        "add v28.4s, v28.4s, v2.4s\n"
-
-        "add x0, x0, %[dst_col_stride]\n"
-        // Load a column
-        "mov x1, x0\n"
-        "ld1 {v0.4s}, [x1], #16\n"
-        "ld1 {v1.4s}, [x1], #16\n"
-        "ld1 {v2.4s}, [x1], #16\n"
-        // Accumulate a column
-        "add v13.4s, v13.4s, v0.4s\n"
-        "add v21.4s, v21.4s, v1.4s\n"
-        "add v29.4s, v29.4s, v2.4s\n"
-
-        "add x0, x0, %[dst_col_stride]\n"
-        // Load a column
-        "mov x1, x0\n"
-        "ld1 {v0.4s}, [x1], #16\n"
-        "ld1 {v1.4s}, [x1], #16\n"
-        "ld1 {v2.4s}, [x1], #16\n"
-        // Accumulate a column
-        "add v14.4s, v14.4s, v0.4s\n"
-        "add v22.4s, v22.4s, v1.4s\n"
-        "add v30.4s, v30.4s, v2.4s\n"
-
-        "add x0, x0, %[dst_col_stride]\n"
-        // Load a column
-        "mov x1, x0\n"
-        "ld1 {v0.4s}, [x1], #16\n"
-        "ld1 {v1.4s}, [x1], #16\n"
-        "ld1 {v2.4s}, [x1], #16\n"
-        // Accumulate a column
-        "add v15.4s, v15.4s, v0.4s\n"
-        "add v23.4s, v23.4s, v1.4s\n"
-        "add v31.4s, v31.4s, v2.4s\n"
-
-        GEMMLOWP_STORE_RESULT_NEON_64_KERNEL_12x8_DEPTH2
-        ":\n"
-
-        "mov x0, %[dst_ptr]\n"
-        // Store a column
-        "mov x1, x0\n"
-        "st1 {v8.4s}, [x1], #16\n"
-        "st1 {v16.4s}, [x1], #16\n"
-        "st1 {v24.4s}, [x1], #16\n"
-        // Store a column
-        "add x0, x0, %[dst_col_stride]\n"
-        "mov x1, x0\n"
-        "st1 {v9.4s}, [x1], #16\n"
-        "st1 {v17.4s}, [x1], #16\n"
-        "st1 {v25.4s}, [x1], #16\n"
-        // Store a column
-        "add x0, x0, %[dst_col_stride]\n"
-        "mov x1, x0\n"
-        "st1 {v10.4s}, [x1], #16\n"
-        "st1 {v18.4s}, [x1], #16\n"
-        "st1 {v26.4s}, [x1], #16\n"
-        // Store a column
-        "add x0, x0, %[dst_col_stride]\n"
-        "mov x1, x0\n"
-        "st1 {v11.4s}, [x1], #16\n"
-        "st1 {v19.4s}, [x1], #16\n"
-        "st1 {v27.4s}, [x1], #16\n"
-        // Store a column
-        "add x0, x0, %[dst_col_stride]\n"
-        "mov x1, x0\n"
-        "st1 {v12.4s}, [x1], #16\n"
-        "st1 {v20.4s}, [x1], #16\n"
-        "st1 {v28.4s}, [x1], #16\n"
-        // Store a column
-        "add x0, x0, %[dst_col_stride]\n"
-        "mov x1, x0\n"
-        "st1 {v13.4s}, [x1], #16\n"
-        "st1 {v21.4s}, [x1], #16\n"
-        "st1 {v29.4s}, [x1], #16\n"
-        // Store a column
-        "add x0, x0, %[dst_col_stride]\n"
-        "mov x1, x0\n"
-        "st1 {v14.4s}, [x1], #16\n"
-        "st1 {v22.4s}, [x1], #16\n"
-        "st1 {v30.4s}, [x1], #16\n"
-        // Store a column
-        "add x0, x0, %[dst_col_stride]\n"
-        "mov x1, x0\n"
-        "st1 {v15.4s}, [x1], #16\n"
-        "st1 {v23.4s}, [x1], #16\n"
-        "st1 {v31.4s}, [x1], #16\n"
+        "st1 {v16.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "st1 {v24.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "st1 {v9.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "st1 {v17.16b}, [x0], #16\n"
+        "st1 {v25.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "st1 {v10.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "st1 {v18.16b}, [x0], #16\n"
+        "st1 {v26.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "st1 {v11.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "st1 {v19.16b}, [x0], #16\n"
+        "st1 {v27.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "st1 {v12.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "st1 {v20.16b}, [x0], #16\n"
+        "st1 {v28.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "st1 {v13.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "st1 {v21.16b}, [x0], #16\n"
+        "st1 {v29.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "st1 {v14.16b}, [x0], #16\n"
+        "add x1, x1, %[dst_col_stride]\n"
+        "st1 {v22.16b}, [x0], #16\n"
+        "st1 {v30.16b}, [x0]\n"
+        "mov x0, x1\n"
+        "st1 {v15.16b}, [x0], #16\n"
+        "st1 {v23.16b}, [x0], #16\n"
+        "st1 {v31.16b}, [x0]\n"
+#undef GEMMLOWP_LABEL_CLEAR_ACCUMULATORS
+#undef GEMMLOWP_LABEL_BEFORE_LOOP
+#undef GEMMLOWP_LABEL_LOOP
+#undef GEMMLOWP_LABEL_AFTER_LOOP
         :  // outputs
         [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
         [dst_ptr] "+r"(dst_ptr),
@@ -970,78 +1609,11 @@
         "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16",
         "v17", "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26",
         "v27", "v28", "v29", "v30", "v31");
-#undef GEMMLOWP_LOOP_NEON_64_KERNEL_12X8_DEPTH2
-#undef GEMMLOWP_STORE_RESULT_NEON_64_KERNEL_12x8_DEPTH2
   }
 };
 
 #endif  // GEMMLOWP_NEON_64
 
-// Our main GEMV kernel.
-// Because our GEMV performance is low and not dominated by the kernel
-// at the moment, it's not worth optimizing too hard yet.
-// Using intrinsics allows us to write one implementation for both 32bit and
-// 64bit ARM, and should also perform OK here because the register pressure
-// is not so high in this GEMV kernel.
-// When/if we get serious about GEMV performance, we will want to
-// implement it to bypass packing altogether, and use source data in-place
-// with different GEMV kernels for row-major and column-major LHS.
-template <int Cells>
-struct NEONKernel4Nx1Depth2 : KernelBase {
-  typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, Cells>,
-                       KernelSideFormat<CellFormat<1, 2>, 1> >
-      Format;
-
-  const char* Name() const override { return "NEON intrinsics, 4Nx1, depth 2"; }
-
-  void Run(std::int32_t* dst_ptr, std::size_t dst_row_stride,
-           std::size_t dst_col_stride, const std::uint8_t* lhs_ptr,
-           const std::uint8_t* rhs_ptr, std::size_t start_depth,
-           std::size_t run_depth) const override {
-    ScopedProfilingLabel label("optimized kernel (NEON 4Nx1)");
-
-    assert(dst_row_stride == 1);
-
-    // Clear accumulators
-    uint32x4_t acc[Cells];
-    for (int cell = 0; cell < Cells; cell++) {
-      acc[cell] = vdupq_n_u32(0);
-    }
-    // Main loop
-    for (std::size_t d = 0; d < run_depth; d += 2) {
-      // Load LHS cells
-      uint16x8_t lhs[Cells];
-      for (int cell = 0; cell < Cells; cell++) {
-        lhs[cell] = vmovl_u8(vld1_u8(lhs_ptr));
-        lhs_ptr += 8;
-      }
-      // Load RHS cell
-      uint16_t rhs0 = rhs_ptr[0];
-      uint16_t rhs1 = rhs_ptr[1];
-      rhs_ptr += 2;
-      // Multiply-accumulate, level of depth 0
-      for (int cell = 0; cell < Cells; cell++) {
-        acc[cell] = vmlal_n_u16(acc[cell], vget_low_u16(lhs[cell]), rhs0);
-      }
-      // Multiply-accumulate, level of depth 1
-      for (int cell = 0; cell < Cells; cell++) {
-        acc[cell] = vmlal_n_u16(acc[cell], vget_high_u16(lhs[cell]), rhs1);
-      }
-    }
-    // If start_depth is nonzero, accumulate with the existing accumulator
-    if (start_depth) {
-      for (int cell = 0; cell < Cells; cell++) {
-        acc[cell] = vaddq_u32(
-            acc[cell], vreinterpretq_u32_s32(vld1q_s32(dst_ptr + 4 * cell)));
-      }
-    }
-    // Store the accumulators
-    for (int cell = 0; cell < Cells; cell++) {
-      vst1q_s32(dst_ptr + 4 * cell, vreinterpretq_s32_u32(acc[cell]));
-    }
-  }
-};
-
 }  // namespace gemmlowp
 
 #endif  // GEMMLOWP_INTERNAL_KERNEL_NEON_H_
diff --git a/internal/kernel_reference.h b/internal/kernel_reference.h
index 020b479..3458c6a 100644
--- a/internal/kernel_reference.h
+++ b/internal/kernel_reference.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -59,15 +59,13 @@
       // The next two loops are over cells of the Lhs (stacked vertically),
       // and over cells of the Rhs (stacked horizontally).
       for (int rc = 0; rc < Format::Lhs::kCells; rc++) {
-        const std::uint8_t* lhs_cell_ptr = lhs_ptr +
-                                           (dc * Format::Lhs::kCells + rc) *
-                                               Format::Lhs::Cell::kWidth *
-                                               Format::kDepth;
+        const std::uint8_t* lhs_cell_ptr =
+            lhs_ptr + (dc * Format::Lhs::kCells + rc) *
+                          Format::Lhs::Cell::kWidth * Format::kDepth;
         for (int cc = 0; cc < Format::Rhs::kCells; cc++) {
-          const std::uint8_t* rhs_cell_ptr = rhs_ptr +
-                                             (dc * Format::Rhs::kCells + cc) *
-                                                 Format::Rhs::Cell::kWidth *
-                                                 Format::kDepth;
+          const std::uint8_t* rhs_cell_ptr =
+              rhs_ptr + (dc * Format::Rhs::kCells + cc) *
+                            Format::Rhs::Cell::kWidth * Format::kDepth;
 
           // Now we are inside one cell of the Lhs and inside one cell
           // of the Rhs, so the remaining inner loops are just
diff --git a/internal/kernel_sse.h b/internal/kernel_sse.h
new file mode 100644
index 0000000..b879fd7
--- /dev/null
+++ b/internal/kernel_sse.h
@@ -0,0 +1,517 @@
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// kernel_SSE.h: a collection of Intel SSE optimized kernels.
+// Check in kernel_default.h which one(s) are actually used by default.
+// Others are mere experiments; they are still covered by tests
+// in case they might be useful some day.
+//
+
+#ifndef GEMMLOWP_INTERNAL_KERNEL_SSE_H_
+#define GEMMLOWP_INTERNAL_KERNEL_SSE_H_
+
+#include "kernel.h"
+
+#include <string.h>
+#include <cassert>
+
+namespace gemmlowp {
+
+#ifdef GEMMLOWP_SSE4_32
+struct SSE4_32_Kernel4x4Depth2 : KernelBase {
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, 1>,
+      KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, 1> >
+      Format;
+
+  const char* Name() const override { return "SSE, 4x4, depth 2"; }
+
+  void Run(std::int32_t* dst_ptr, std::size_t dst_row_stride,
+           std::size_t dst_col_stride, const std::uint8_t* lhs_ptr,
+           const std::uint8_t* rhs_ptr, std::size_t start_depth,
+           std::size_t run_depth) const override {
+    ScopedProfilingLabel label("optimized kernel");
+    assert(dst_row_stride == 1);
+    std::int32_t run_depth_cells = run_depth / Format::kDepth;
+    /* Main loop */
+
+    // A 2x4 cell of Rhs is stored in 16bit in xmm1 .
+    // A 4x2 block Lhs is stored in 16bit in xmm0.
+    // A 4x4 block of accumulators is stored in 32bit in xmm4--xmm7.
+    //
+    //                   +-------+-------+-------+-------+
+    //                   |xmm1[0]|xmm1[2]|xmm1[4]|xmm1[6]|
+    //              Rhs  +-------+---------------+-------+
+    //                   |xmm1[1]|xmm1[3]|xmm1[5]|xmm1[7]|
+    //                   +-------+-------+-------+-------+
+    //
+    //                   |       |       |       |       |
+    //
+    //    Lhs            |       |       |       |       |
+    //
+    //  +--+--+ - - - -  +-------+-------+-------+-------+
+    //  |xmm0 |          | xmm4  | xmm5  | xmm6  | xmm7  |
+    //  |xmm0 | (Iter1)  | xmm4  | xmm5  | xmm6  | xmm7  |
+    //  |xmm0 |          | xmm4  | xmm5  | xmm6  | xmm7  |
+    //  |xmm0 |          | xmm4  | xmm5  | xmm6  | xmm7  |
+    //  +--+--+ - - - -  +-------+-------+-------+-------+
+    //
+    //                              Accumulator
+
+    asm volatile(
+
+        // set accumulators to zero.
+        "pxor %%xmm4  , %%xmm4 \n\t"
+        "pxor %%xmm5  , %%xmm5 \n\t"
+        "pxor %%xmm6  , %%xmm6 \n\t"
+        "pxor %%xmm7  , %%xmm7 \n\t"
+
+        "movl  %[run_depth_cells], %%eax\n\t"
+        "subl $2, %%eax\n\t"
+        "js outerLoop1%=\n\t"
+
+        // Loop for K unrolled by 4
+        "outerLoop2%=:\n\t"
+
+        // K = 1,2
+        // RHS cell to xmm1
+        "pmovzxbw (%[rhs_ptr]), %%xmm1\n\t"
+
+        // LHS cell
+        "pmovzxbw 0x00(%[lhs_ptr]), %%xmm0\n\t"
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm4           \n\t"
+        "paddd %%xmm3, %%xmm5           \n\t"
+
+        "prefetcht0 0x80(%[lhs_ptr]) \n\t"
+
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+
+        "prefetcht0 0x80(%[rhs_ptr]) \n\t"
+
+        // K = 3,4
+        // RHS cell to xmm1
+        "pmovzxbw 0x08(%[rhs_ptr]), %%xmm1\n\t"
+
+        "paddd %%xmm2, %%xmm6           \n\t"
+        "paddd %%xmm3, %%xmm7           \n\t"
+
+        // LHS cell
+        "pmovzxbw 0x08(%[lhs_ptr]), %%xmm0\n\t"
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm4           \n\t"
+        "paddd %%xmm3, %%xmm5           \n\t"
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+
+        "addl $0x10, %[lhs_ptr]         \n\t"
+        "addl $0x10, %[rhs_ptr]         \n\t"
+
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm3, %%xmm7           \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "paddd %%xmm2, %%xmm6           \n\t"
+
+        "subl $2, %[run_depth_cells]\n\t"
+        "ja outerLoop2%=\n\t"
+
+        "movl %[run_depth_cells], %%eax\n\t"
+        "decl %%eax\n\t"
+        "js finish%=\n\t"
+
+        // Loop for K unrolled by 2
+        "outerLoop1%=:\n\t"
+
+        // RHS cell to xmm1
+        "pmovzxbw (%[rhs_ptr]), %%xmm1\n\t"
+
+        // LHS cell
+        "pmovzxbw 0x00(%[lhs_ptr]), %%xmm0\n\t"
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "paddd %%xmm2, %%xmm4           \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm3, %%xmm5           \n\t"
+
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "paddd %%xmm2, %%xmm6           \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm3, %%xmm7           \n\t"
+
+        "addl $0x08, %[lhs_ptr]\n\t"
+        "addl $0x08, %[rhs_ptr]\n\t"
+
+        "decl %[run_depth_cells]\n\t"
+        "jnz outerLoop1%=\n\t"
+
+        "finish%=:\n\t"
+
+        "movl  %[dst_col_stride], %%eax\n\t"
+        "shll $2, %%eax\n\t"
+
+        "movl  %[start_depth], %%ecx\n\t"
+        "test %%ecx, %%ecx\n\t"
+        "jz storeDst%=\n\t"
+
+        "leal (%%eax,%%eax,0x2), %%ecx\n\t"
+        "paddd 0x00(%[dst_ptr])           , %%xmm4 \n\t"
+        "paddd 0x00(%[dst_ptr], %%eax, 1) , %%xmm5 \n\t"
+        "paddd 0x00(%[dst_ptr], %%eax, 2) , %%xmm6 \n\t"
+        "paddd 0x00(%[dst_ptr], %%ecx, 1) , %%xmm7 \n\t"
+
+        "storeDst%=:\n\t"
+
+        "leal (%%eax,%%eax,0x2), %%ecx\n\t"
+        "movdqu %%xmm4  , 0x00(%[dst_ptr])          \n\t"
+        "movdqu %%xmm5  , 0x00(%[dst_ptr], %%eax, 1)\n\t"
+        "movdqu %%xmm6  , 0x00(%[dst_ptr], %%eax, 2)\n\t"
+        "movdqu %%xmm7  , 0x00(%[dst_ptr], %%ecx, 1)\n\t"
+
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [dst_ptr] "+r"(dst_ptr)
+        :  // inputs
+        [start_depth] "g"(start_depth), [dst_col_stride] "g"(dst_col_stride),
+        [run_depth_cells] "g"(run_depth_cells)
+        :  // clobbers
+        "cc", "memory", "%xmm0", "%xmm1", "%xmm3", "%xmm2", "%xmm4", "%xmm5",
+        "%xmm6", "%xmm7", "%eax", "%ecx");
+  }
+};
+#endif
+#ifdef GEMMLOWP_SSE4_64
+struct SSE4_64_Kernel12x4Depth2 : KernelBase {
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, 1> >
+      Format;
+
+  const char* Name() const override { return "SSE, 12x4, depth 2"; }
+
+  void Run(std::int32_t* dst_ptr, std::size_t dst_row_stride,
+           std::size_t dst_col_stride, const std::uint8_t* lhs_ptr,
+           const std::uint8_t* rhs_ptr, std::size_t start_depth,
+           std::size_t run_depth) const override {
+    ScopedProfilingLabel label("optimized kernel");
+    assert(dst_row_stride == 1);
+    const std::int64_t run_depth_cells = run_depth / Format::kDepth;
+    const std::int64_t dst_col_stride_q = dst_col_stride;
+
+    /* Main loop */
+
+    // A 2x4 cell of Rhs is stored in 16bit in xmm1 .
+    // A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in xmm0, replaced
+    // every Iteration.
+    // A 12x4 block of accumulators is stored in 32bit in xmm4--xmm15.
+    //
+    //                   +-------+-------+-------+-------+
+    //                   |xmm1[0]|xmm1[2]|xmm1[4]|xmm1[6]|
+    //              Rhs  +-------+---------------+-------+
+    //                   |xmm1[1]|xmm1[3]|xmm1[5]|xmm1[7]|
+    //                   +-------+-------+-------+-------+
+    //
+    //                   |       |       |       |       |
+    //
+    //    Lhs            |       |       |       |       |
+    //
+    //  +--+--+ - - - -  +-------+-------+-------+-------+
+    //  |xmm0 |          | xmm4  | xmm5  | xmm6  | xmm7  |
+    //  |xmm0 | (Iter1)  | xmm4  | xmm5  | xmm6  | xmm7  |
+    //  |xmm0 |          | xmm4  | xmm5  | xmm6  | xmm7  |
+    //  |xmm0 |          | xmm4  | xmm5  | xmm6  | xmm7  |
+    //  +--+--+ - - - -  +-------+-------+-------+-------+
+    //  |xmm0 |          | xmm8  | xmm9  | xmm10 | xmm11 |
+    //  |xmm0 | (Iter2)  | xmm8  | xmm9  | xmm10 | xmm11 |
+    //  |xmm0 |          | xmm8  | xmm9  | xmm10 | xmm11 |
+    //  |xmm0 |          | xmm8  | xmm9  | xmm10 | xmm11 |
+    //  +--+--+ - - - -  +-------+-------+-------+-------+
+    //  |xmm0 |          | xmm12 | xmm13 | xmm14 | xmm15 |
+    //  |xmm0 | (Iter3)  | xmm12 | xmm13 | xmm14 | xmm15 |
+    //  |xmm0 |          | xmm12 | xmm13 | xmm14 | xmm15 |
+    //  |xmm0 |          | xmm12 | xmm13 | xmm14 | xmm15 |
+    //  +--+--+ - - - -  +-------+-------+-------+-------+
+    //
+    //                              Accumulator
+
+    asm volatile(
+
+        // Set registers for destination
+        "movq  %[dst_col_stride_q], %%r12\n\t"
+        "shlq $2, %%r12\n\t"
+        "leaq (%%r12,%%r12,0x2), %%r13\n\t"
+
+        // Set accumulators to zero.
+        "pxor %%xmm4  , %%xmm4 \n\t"
+        "pxor %%xmm5  , %%xmm5 \n\t"
+        "pxor %%xmm6  , %%xmm6 \n\t"
+        "pxor %%xmm7  , %%xmm7 \n\t"
+        "pxor %%xmm8  , %%xmm8 \n\t"
+        "pxor %%xmm9  , %%xmm9 \n\t"
+        "pxor %%xmm10 , %%xmm10\n\t"
+        "pxor %%xmm11 , %%xmm11\n\t"
+        "pxor %%xmm12 , %%xmm12\n\t"
+        "pxor %%xmm13 , %%xmm13\n\t"
+        "pxor %%xmm14 , %%xmm14\n\t"
+        "pxor %%xmm15 , %%xmm15\n\t"
+
+        "movq  %[run_depth_cells], %%r14\n\t"
+        "subq $2, %%r14\n\t"
+        "js outerLoop1%=\n\t"
+
+        // Loop for K unrolled by 4
+        "outerLoop2%=:\n\t"
+
+        // K = 1,2
+        // RHS cell to xmm1
+
+        "pmovzxbw (%[rhs_ptr]), %%xmm1\n\t"
+
+        // LHS cell
+        "pmovzxbw 0x00(%[lhs_ptr]), %%xmm0\n\t"
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm4           \n\t"
+        "paddd %%xmm3, %%xmm5           \n\t"
+
+        "prefetcht0 0x80(%[lhs_ptr]) \n\t"
+
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+
+        // next LHS cell
+        "pmovzxbw 0x08(%[lhs_ptr]), %%xmm0\n\t"
+
+        "paddd %%xmm2, %%xmm6           \n\t"
+        "paddd %%xmm3, %%xmm7           \n\t"
+
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm8           \n\t"
+        "paddd %%xmm3, %%xmm9           \n\t"
+
+        "prefetcht0 0x80(%[rhs_ptr]) \n\t"
+
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm10          \n\t"
+        "paddd %%xmm3, %%xmm11          \n\t"
+
+        // next LHS cell
+        "pmovzxbw 0x10(%[lhs_ptr]), %%xmm0\n\t"
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm12          \n\t"
+        "paddd %%xmm3, %%xmm13          \n\t"
+
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm14          \n\t"
+        "paddd %%xmm3, %%xmm15          \n\t"
+
+        // K = 3,4
+        // RHS cell to xmm1
+        "pmovzxbw 0x08(%[rhs_ptr]), %%xmm1\n\t"
+
+        // LHS cell
+        "pmovzxbw 0x18(%[lhs_ptr]), %%xmm0\n\t"
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm4           \n\t"
+        "paddd %%xmm3, %%xmm5           \n\t"
+
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm6           \n\t"
+        "paddd %%xmm3, %%xmm7           \n\t"
+
+        // next LHS cell
+        "pmovzxbw 0x20(%[lhs_ptr]), %%xmm0\n\t"
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm8           \n\t"
+        "paddd %%xmm3, %%xmm9           \n\t"
+
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm10          \n\t"
+        "paddd %%xmm3, %%xmm11          \n\t"
+
+        // next LHS cell
+        "pmovzxbw 0x28(%[lhs_ptr]), %%xmm0\n\t"
+
+        "addq $0x30, %[lhs_ptr]         \n\t"
+        "addq $0x10, %[rhs_ptr]         \n\t"
+
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm12          \n\t"
+        "paddd %%xmm3, %%xmm13          \n\t"
+
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm14          \n\t"
+        "paddd %%xmm3, %%xmm15          \n\t"
+
+        "subq $2, %[run_depth_cells]\n\t"
+        "ja outerLoop2%=\n\t"
+
+        "movq %[run_depth_cells], %%r14\n\t"
+        "decq %%r14\n\t"
+        "js finish%=\n\t"
+
+        // Loop for K unrolled by 2
+        "outerLoop1%=:\n\t"
+
+        // RHS cell to xmm1
+        "pmovzxbw (%[rhs_ptr]), %%xmm1\n\t"
+
+        // LHS cell
+        "pmovzxbw 0x00(%[lhs_ptr]), %%xmm0\n\t"
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm4           \n\t"
+        "paddd %%xmm3, %%xmm5           \n\t"
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm6           \n\t"
+        "paddd %%xmm3, %%xmm7           \n\t"
+
+        // next LHS cell
+        "pmovzxbw 0x08(%[lhs_ptr]), %%xmm0\n\t"
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm8           \n\t"
+        "paddd %%xmm3, %%xmm9           \n\t"
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm10          \n\t"
+        "paddd %%xmm3, %%xmm11          \n\t"
+
+        // next LHS cell
+        "pmovzxbw 0x10(%[lhs_ptr]), %%xmm0\n\t"
+
+        "addq $0x18, %[lhs_ptr]         \n\t"
+        "addq $0x08, %[rhs_ptr]         \n\t"
+
+        "pshufd $0x00,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0x55,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm12          \n\t"
+        "paddd %%xmm3, %%xmm13          \n\t"
+        "pshufd $0xaa,%%xmm1,%%xmm2     \n\t"
+        "pshufd $0xff,%%xmm1,%%xmm3     \n\t"
+        "pmaddwd %%xmm0, %%xmm2         \n\t"
+        "pmaddwd %%xmm0, %%xmm3         \n\t"
+        "paddd %%xmm2, %%xmm14          \n\t"
+        "paddd %%xmm3, %%xmm15          \n\t"
+
+        "decq %[run_depth_cells]\n\t"
+        "jnz outerLoop1%=\n\t"
+
+        "finish%=:\n\t"
+
+        "test %[start_depth], %[start_depth]\n\t"
+        "jz storeDst%=\n\t"
+
+        "paddd 0x00(%[dst_ptr])           , %%xmm4 \n\t"
+        "paddd 0x10(%[dst_ptr])           , %%xmm8 \n\t"
+        "paddd 0x20(%[dst_ptr])           , %%xmm12\n\t"
+        "paddd 0x00(%[dst_ptr], %%r12, 1) , %%xmm5 \n\t"
+        "paddd 0x10(%[dst_ptr], %%r12, 1) , %%xmm9 \n\t"
+        "paddd 0x20(%[dst_ptr], %%r12, 1) , %%xmm13\n\t"
+        "paddd 0x00(%[dst_ptr], %%r12, 2) , %%xmm6 \n\t"
+        "paddd 0x10(%[dst_ptr], %%r12, 2) , %%xmm10\n\t"
+        "paddd 0x20(%[dst_ptr], %%r12, 2) , %%xmm14\n\t"
+        "paddd 0x00(%[dst_ptr], %%r13, 1) , %%xmm7 \n\t"
+        "paddd 0x10(%[dst_ptr], %%r13, 1) , %%xmm11\n\t"
+        "paddd 0x20(%[dst_ptr], %%r13, 1) , %%xmm15\n\t"
+
+        "storeDst%=:\n\t"
+
+        "movdqu %%xmm4  , 0x00(%[dst_ptr])          \n\t"
+        "movdqu %%xmm8  , 0x10(%[dst_ptr])          \n\t"
+        "movdqu %%xmm12 , 0x20(%[dst_ptr])          \n\t"
+        "movdqu %%xmm5  , 0x00(%[dst_ptr], %%r12, 1)\n\t"
+        "movdqu %%xmm9  , 0x10(%[dst_ptr], %%r12, 1)\n\t"
+        "movdqu %%xmm13 , 0x20(%[dst_ptr], %%r12, 1)\n\t"
+        "movdqu %%xmm6  , 0x00(%[dst_ptr], %%r12, 2)\n\t"
+        "movdqu %%xmm10 , 0x10(%[dst_ptr], %%r12, 2)\n\t"
+        "movdqu %%xmm14 , 0x20(%[dst_ptr], %%r12, 2)\n\t"
+        "movdqu %%xmm7  , 0x00(%[dst_ptr], %%r13, 1)\n\t"
+        "movdqu %%xmm11 , 0x10(%[dst_ptr], %%r13, 1)\n\t"
+        "movdqu %%xmm15 , 0x20(%[dst_ptr], %%r13, 1)\n\t"
+
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [dst_ptr] "+r"(dst_ptr)
+        :  // inputs
+        [start_depth] "r"(start_depth),
+        [dst_col_stride_q] "r"(dst_col_stride_q),
+        [run_depth_cells] "r"(run_depth_cells)
+        :  // clobbers
+        "cc", "memory", "%xmm0", "%xmm1", "%xmm3", "%xmm2", "%xmm4", "%xmm5",
+        "%xmm6", "%xmm7", "%xmm8", "%xmm9", "%xmm10", "%r12", "%r13", "%r14",
+        "%xmm11", "%xmm12", "%xmm13", "%xmm14", "%xmm15");
+  }
+};
+#endif
+
+}  // namespace gemmlowp
+
+#endif  // GEMMLOWP_INTERNAL_KERNEL_SSE_H_
diff --git a/internal/multi_thread_gemm.h b/internal/multi_thread_gemm.h
index 0aacddb..0234b26 100644
--- a/internal/multi_thread_gemm.h
+++ b/internal/multi_thread_gemm.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -27,9 +27,15 @@
 
 namespace gemmlowp {
 
-#ifdef GEMMLOWP_ALLOW_INLINE_ASM
-// Where inline asm is allowed, we use some busy-waiting,
-// preferably implemented using NOP instructions.
+// On X86 and ARM platforms we enable a busy-wait spinlock before waiting on a
+// pthread conditional variable. In order to implement that correctly we need
+// to put some explicit memory load/store barriers.
+
+#if defined(GEMMLOWP_ALLOW_INLINE_ASM) && !defined(GEMMLOWP_NO_BUSYWAIT) && \
+    (defined(GEMMLOWP_ARM) || defined(GEMMLOWP_X86))
+
+#define GEMMLOWP_USE_BUSYWAIT
+
 const int kMaxBusyWaitNOPs = 32 * 1000 * 1000;
 
 #define GEMMLOWP_NOP "nop\n"
@@ -38,11 +44,10 @@
 #define GEMMLOWP_NOP4 GEMMLOWP_STRING_CONCAT_4(GEMMLOWP_NOP)
 #define GEMMLOWP_NOP16 GEMMLOWP_STRING_CONCAT_4(GEMMLOWP_NOP4)
 #define GEMMLOWP_NOP64 GEMMLOWP_STRING_CONCAT_4(GEMMLOWP_NOP16)
-#define GEMMLOWP_NOP256 GEMMLOWP_STRING_CONCAT_4(GEMMLOWP_NOP64)
 
 inline int Do256NOPs() {
-  asm volatile(GEMMLOWP_NOP256);
-  return 256;
+  asm volatile(GEMMLOWP_NOP64);
+  return 64;
 }
 
 #undef GEMMLOWP_STRING_CONCAT_4
@@ -52,20 +57,6 @@
 #undef GEMMLOWP_NOP4
 #undef GEMMLOWP_NOP
 
-#else  // not GEMMLOWP_ALLOW_INLINE_ASM
-
-// It is nontrivial to implement a good busy-waiting without
-// using asm; NOP instructions have the least side effects
-// and the lowest power usage; and since the whole busy-waiting
-// story is an optimization, it's not very interesting anyway
-// in places where we're slow anyway due to not being able to
-// use our inline asm kernels.
-
-const int kMaxBusyWaitNOPs = 0;
-inline int Do256NOPs() { return 0; }
-
-#endif  // not GEMMLOWP_ALLOW_INLINE_ASM
-
 inline void WriteBarrier() {
 #ifdef GEMMLOWP_ARM_32
   MemoryBarrier();
@@ -73,8 +64,6 @@
   asm volatile("dmb ishst" ::: "memory");
 #elif defined(GEMMLOWP_X86)
   asm volatile("sfence" ::: "memory");
-#elif defined(__mips__)
-  MemoryBarrier();
 #else
 #error "Unsupported architecture for WriteBarrier."
 #endif
@@ -87,13 +76,13 @@
   asm volatile("dmb ishld" ::: "memory");
 #elif defined(GEMMLOWP_X86)
   asm volatile("lfence" ::: "memory");
-#elif defined(__mips__)
-  MemoryBarrier();
 #else
 #error "Unsupported architecture for ReadBarrier."
 #endif
 }
 
+#endif
+
 // Waits until *var != initial_value.
 //
 // Returns the new value of *var. The guarantee here is that
@@ -119,23 +108,31 @@
 template <typename T>
 T WaitForVariableChange(volatile T* var, T initial_value, pthread_cond_t* cond,
                         pthread_mutex_t* mutex) {
-  int nops = 0;
-  // First, trivial case where the variable already changed value.
-  T new_value = *var;
-  if (new_value != initial_value) {
-    return new_value;
-  }
-  // Then try busy-waiting.
-  while (nops < kMaxBusyWaitNOPs) {
-    nops += Do256NOPs();
-    new_value = *var;
+#ifdef GEMMLOWP_USE_BUSYWAIT
+  // If we are on a platform that supports it, spin for some time.
+  {
+    int nops = 0;
+    // First, trivial case where the variable already changed value.
+    T new_value = *var;
     if (new_value != initial_value) {
+      ReadBarrier();
       return new_value;
     }
+    // Then try busy-waiting.
+    while (nops < kMaxBusyWaitNOPs) {
+      nops += Do256NOPs();
+      new_value = *var;
+      if (new_value != initial_value) {
+        ReadBarrier();
+        return new_value;
+      }
+    }
   }
+#endif
+
   // Finally, do real passive waiting.
   pthread_mutex_lock(mutex);
-  new_value = *var;
+  T new_value = *var;
   if (new_value == initial_value) {
     pthread_cond_wait(cond, mutex);
     new_value = *var;
@@ -174,6 +171,9 @@
     pthread_mutex_lock(&mutex_);
     assert(count_ > 0);
     count_--;
+#ifdef GEMMLOWP_USE_BUSYWAIT
+    WriteBarrier();
+#endif
     if (count_ == 0) {
       pthread_cond_signal(&cond_);
     }
@@ -206,7 +206,7 @@
 struct Task {
   Task() : local_allocator(nullptr) {}
   virtual ~Task() {}
-  virtual void Run() const = 0;
+  virtual void Run() = 0;
   Allocator* local_allocator;
 };
 
@@ -283,10 +283,8 @@
       switch (state_to_act_upon) {
         case State::HasWork:
           // Got work to do! So do it, and then revert to 'Ready' state.
-          ReadBarrier();
           assert(task_);
           task_->Run();
-          delete task_;
           task_ = nullptr;
           ChangeState(State::Ready);
           break;
@@ -309,7 +307,9 @@
     assert(!task_);
     task->local_allocator = &local_allocator_;
     task_ = task;
+#ifdef GEMMLOWP_USE_BUSYWAIT
     WriteBarrier();
+#endif
     assert(state_ == State::Ready);
     ChangeState(State::HasWork);
   }
@@ -319,7 +319,7 @@
   pthread_t thread_;
 
   // The task to be worked on.
-  const Task* task_;
+  Task* task_;
 
   // The condition variable and mutex guarding state changes.
   pthread_cond_t state_cond_;
@@ -341,6 +341,11 @@
 // specific parallelization pattern that we use here:
 // a fixed number of workers can be given work, and one then
 // waits for all of them to finish.
+//
+// See MultiThreadGemmContextBase for how other WorkersPool implementations can
+// be used. Note that in those implementations, StartWorker can be free to
+// ignore the <index> value; that is, the caller of WorkersPool does not rely on
+// <index> to order tasks with equal <index>.
 class WorkersPool {
  public:
   WorkersPool() {}
@@ -351,16 +356,31 @@
     }
   }
 
-  BlockingCounter& counter_to_decrement_when_ready() {
-    return counter_to_decrement_when_ready_;
+  void Execute(const std::vector<Task*>& tasks) {
+    assert(tasks.size() >= 1);
+    // One of the tasks will be run on the current thread.
+    int workers_count = tasks.size() - 1;
+    CreateWorkers(workers_count);
+    assert(workers_count <= workers_.size());
+    counter_to_decrement_when_ready_.Reset(workers_count);
+    int n = 0;
+    std::for_each(tasks.begin(), --tasks.end(), [this, &n](Task *task) {
+      workers_[n++]->StartWork(task);
+    });
+    // Execute the remaining workload immediately on the current thread.
+    Task* task = tasks.back();
+    task->local_allocator = &main_thread_task_allocator_;
+    task->Run();
+    // Wait for the workers submitted above to finish.
+    counter_to_decrement_when_ready_.Wait();
+    // Cleanup tasks (best to do this from the same thread that allocated
+    // the memory).
+    std::for_each(tasks.begin(), tasks.end(), [](Task *task) {
+      delete task;
+    });
   }
 
-  // Give work to a specific worker.
-  void StartWorker(int index, Task* task_) {
-    assert(static_cast<std::size_t>(index) < workers_.size());
-    workers_[index]->StartWork(task_);
-  }
-
+ private:
   // Ensures that the pool has at least the given count of workers.
   // If any new worker has to be created, this function waits for it to
   // be ready.
@@ -375,7 +395,6 @@
     counter_to_decrement_when_ready_.Wait();
   }
 
- private:
   // copy construction disallowed
   WorkersPool(const WorkersPool&) = delete;
 
@@ -385,6 +404,14 @@
 
   // The BlockingCounter used to wait for the workers.
   BlockingCounter counter_to_decrement_when_ready_;
+
+  // For N-threaded operations, we will use only N-1 worker threads
+  // while the last task will be run directly on the main thread.
+  // It will then use this main_thread_task_allocator_; having a
+  // dedicated allocator for that (separate from the base allocator_)
+  // allows to use the same code for all tasks regardless of which
+  // thread they run on.
+  Allocator main_thread_task_allocator_;
 };
 
 // The task we use to implement a multi-threaded Gemm: a block of the
@@ -394,34 +421,41 @@
 template <typename KernelFormat, typename InputScalar, typename OutputScalar,
           typename BitDepthParams, MapOrder LhsOrder, MapOrder RhsOrder,
           MapOrder ResultOrder, typename LhsOffset, typename RhsOffset,
-          typename OutputPipelineType>
+  typename OutputPipelineType, typename GemmContextType>
 struct GemmWithPackedRhsTask : Task {
   typedef PackedSideBlock<typename KernelFormat::Lhs> PackedLhs;
   typedef PackedSideBlock<typename KernelFormat::Rhs> PackedRhs;
-  GemmWithPackedRhsTask(const KernelBase& _kernel,
+  GemmWithPackedRhsTask(GemmContextType* _context,
+                        const KernelBase& _kernel,
                         const MatrixMap<const InputScalar, LhsOrder>& _lhs,
                         const PackedRhs& _packed_rhs,
                         MatrixMap<OutputScalar, ResultOrder>* _result,
+                        const MatrixBlockBounds& _result_block,
                         const LhsOffset& _lhs_offset,
                         const RhsOffset& _rhs_offset,
                         const OutputPipelineType& _output_pipeline)
-      : kernel(_kernel),
+      : context(_context),
+        kernel(_kernel),
         lhs(_lhs),
         packed_rhs(_packed_rhs),
         result(*_result),
+        result_block(_result_block),
         lhs_offset(_lhs_offset),
         rhs_offset(_rhs_offset),
         output_pipeline(_output_pipeline) {}
 
-  void Run() const override {
+  void Run() override {
     ScopedProfilingLabel label("GemmWithPackedRhsTask");
 
-    const int rows = result.rows();
-    const int cols = result.cols();
+    const int rows = result_block.rows;
+    const int cols = result_block.cols;
     const int depth = lhs.cols();
 
     BlockParams block_params;
-    block_params.Init<KernelFormat>(rows, cols, depth, 1);
+    block_params.Init<KernelFormat>(rows, cols, depth, 1,
+                                    context->l1_bytes_to_use(),
+                                    context->l2_bytes_to_use(),
+                                    context->l2_rhs_factor());
 
     PackedLhs packed_lhs(Side::Lhs, local_allocator, block_params);
 
@@ -435,74 +469,92 @@
       for (int r = 0; r < rows; r += block_params.l2_rows) {
         int rs = std::min(block_params.l2_rows, rows - r);
 
-        PackLhs<BitDepthParams>(&packed_lhs, lhs.block(r, 0, rs, depth));
+        PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));
 
-        Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);
+        Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs,
+                depth);
 
-        auto result_block = result.block(r, c, rs, cs);
-        UnpackResult<BitDepthParams>(&result_block, packed_result, depth,
-                                     packed_lhs.sums_of_each_slice(),
-                                     packed_rhs.sums_of_each_slice(),
-                                     lhs_offset, rhs_offset, output_pipeline);
+        auto curr_result_block = MatrixBlockBounds(
+            result_block.start_row + r, result_block.start_col + c, rs, cs);
+        UnpackResult<KernelFormat>(
+            &result, curr_result_block, packed_result, depth,
+            packed_lhs.sums_of_each_slice(), packed_rhs.sums_of_each_slice(),
+            lhs_offset.block(curr_result_block.start_row, rs),
+            rhs_offset.block(curr_result_block.start_col, cs), output_pipeline);
       }
     }
 
     local_allocator->Decommit();
   }
 
+  const GemmContextType* context;
   const KernelBase& kernel;
   const MatrixMap<const InputScalar, LhsOrder> lhs;
   const PackedRhs packed_rhs;
   MatrixMap<OutputScalar, ResultOrder> result;
+  const MatrixBlockBounds result_block;
   const LhsOffset& lhs_offset;
   const RhsOffset& rhs_offset;
   const OutputPipelineType& output_pipeline;
 };
 
-class MultiThreadGemmContext : public SingleThreadGemmContext {
+// This base class for multi-threading allows subclasses to implement their own
+// workers_pool() method.  See MultiThreadGemmContext below for an example;
+// any other implementation of workers_pool() must return an object with the
+// same public methods as WorkersPool.
+class MultiThreadGemmContextBase : public SingleThreadGemmContext {
  public:
-  MultiThreadGemmContext() : max_num_threads_(0) {}
-
   void set_max_num_threads(int n) { max_num_threads_ = n; }
 
   int max_num_threads() const { return max_num_threads_; }
 
+ protected:
+  // The maximum number of worker threads to use (including
+  // the master thread).
+  // The default value 1 means single-threading. That is the default
+  // because gemmlowp's primary target is mobile hardware, where thermal
+  // constraints usually mean that it may not be realistic to use more
+  // than 1 CPU core even if multiple cores are present.
+  // The special value 0 means try to detect the number of hardware threads.
+  // Note: this assumes that all CPU cores are equivalent. That assumption
+  // is defeated on big.LITTLE ARM devices, where we have no API to query
+  // the number of big cores (which is typically what we would want to use,
+  // leaving aside above-mentioned thermal issues). That is the other reason
+  // why the best compromise here is to let max_num_threads_ default to 1,
+  // so users who want multi-threading have to make the decision of how many
+  // threads to use by themselves.
+  int max_num_threads_ = 1;
+};
+
+class MultiThreadGemmContext : public MultiThreadGemmContextBase {
+ public:
   WorkersPool* workers_pool() { return &workers_pool_; }
 
-  Allocator* main_thread_task_allocator() {
-    return &main_thread_task_allocator_;
-  }
-
- protected:
+ private:
   // The workers pool used by MultiThreadGemm. Making
   // this part of the context allows it to be persistent,
   // avoiding recreating threads on every Gemm.
   WorkersPool workers_pool_;
-
-  // The maximum number of worker threads to use (in addition
-  // to the master thread).
-  // The default value 0 means the default behavior of
-  // detecting the number of hardware threads. Nonzero values mean
-  // skipping and overriding hardware detection.
-  int max_num_threads_;
-
-  // For N-threaded operations, we will use only N-1 worker threads
-  // while the last task will be run directly on the main thread.
-  // It will then use this main_thread_task_allocator_; having a
-  // dedicated allocator for that (separate from the base allocator_)
-  // allows to use the same code for all tasks regardless of which
-  // thread they run on.
-  Allocator main_thread_task_allocator_;
 };
 
+// Needed by chrome native builds
+#ifndef _SC_NPROCESSORS_CONF
+#define _SC_NPROCESSORS_CONF _SC_NPROCESSORS_ONLN
+#endif
+
 // Determines how many threads should be used for a given Gemm
 // operation.
 template <int KernelRows>
-inline int HowManyThreads(MultiThreadGemmContext* context, int rows, int cols,
-                          int depth) {
-  // First check if the user set an explicit maximum number of threads.
-  int max_count = context->max_num_threads();
-  if (!max_count) {
+inline int HowManyThreads(int max_num_threads, int rows, int cols, int depth) {
+  // Early-exit in the default case where multi-threading is disabled.
+  if (max_num_threads == 1) {
+    return 1;
+  }
+
+  // Determine the maximum number of threads.
+  int max_count = max_num_threads;
+  // The special value 0 means try to determine the total number of cores.
+  if (max_count == 0) {
     // No user-set maximum number of threads, so we need to
     // do some hardware detection.
     // This is expensive to query so we do it only once.
@@ -553,15 +605,15 @@
 }
 
 // The main multi-threaded Gemm function.
-// To understand it, first read the code of SingleThreadedGemm().
+// To understand it, first read the code of SingleThreadGemm().
 // The parallelization scheme used here is to have this master function
 // pack a block of RHS and then start worker threads to pack a block of LHS
 // each, and accumulate the corresponding products.
 template <typename KernelFormat, typename InputScalar, typename OutputScalar,
           typename BitDepthParams, MapOrder LhsOrder, MapOrder RhsOrder,
           MapOrder ResultOrder, typename LhsOffset, typename RhsOffset,
-          typename OutputPipelineType>
-void MultiThreadGemm(MultiThreadGemmContext* context, const KernelBase& kernel,
+          typename OutputPipelineType, typename GemmContextType>
+void MultiThreadGemm(GemmContextType* context, const KernelBase& kernel,
                      const MatrixMap<const InputScalar, LhsOrder>& lhs,
                      const MatrixMap<const InputScalar, RhsOrder>& rhs,
                      MatrixMap<OutputScalar, ResultOrder>* result,
@@ -575,12 +627,16 @@
   int cols = result->cols();
   int depth = lhs.cols();
 
+  // zero sizes should have been caught earlier and early-returned.
   assert(rows > 0);
   assert(cols > 0);
   assert(depth > 0);
 
-  const int thread_count =
-      HowManyThreads<KernelFormat::kRows>(context, rows, cols, depth);
+  // The case of rows<cols should have been caught earlier and transposed.
+  assert(rows >= cols);
+
+  const int thread_count = HowManyThreads<KernelFormat::kRows>(
+      context->max_num_threads(), rows, cols, depth);
   if (thread_count == 1) {
     return SingleThreadGemm<KernelFormat, InputScalar, OutputScalar,
                             BitDepthParams>(context, kernel, lhs, rhs, result,
@@ -589,26 +645,22 @@
   }
   assert(thread_count > 1);
 
-  // We choose to use a worker thread for all but one
-  // of the thread workloads. The remaining thread workload will be
-  // executed immediately on the current thread.
-  // In this way, the total number of threads (1 master, N-1 workers)
-  // equals the value returned by HowManyThread. This simple
-  // 1:1 mapping of threads to physical cores, is very important
-  // to getting good multithreaded performance especially for
-  // not-very-large GEMMs, and especially on Android.
-  const int workers_count = thread_count - 1;
+  // Simple 1:1 mapping of tasks to physical cores, which is very important
+  // to getting good multithreaded performance, specially for not-very-large
+  // GEMMs, and especially on Android.
+  const int task_count = thread_count;
 
   Allocator* allocator = context->allocator();
-  WorkersPool* workers_pool = context->workers_pool();
-
-  workers_pool->CreateWorkers(workers_count);
+  auto* workers_pool = context->workers_pool();
 
   BlockParams block_params;
-  block_params.Init<KernelFormat>(rows, cols, depth, workers_count);
+  block_params.Init<KernelFormat>(rows, cols, depth, task_count,
+                                  context->l1_bytes_to_use(),
+                                  context->l2_bytes_to_use(),
+                                  context->l2_rhs_factor());
 
-  PackedSideBlock<typename KernelFormat::Rhs> packed_rhs(
-      Side::Rhs, allocator, block_params);
+  PackedSideBlock<typename KernelFormat::Rhs> packed_rhs(Side::Rhs, allocator,
+                                                         block_params);
   allocator->Commit();
 
   // We loop over large blocks of the RHS.
@@ -616,37 +668,29 @@
     int cs = std::min(block_params.l2_cols, cols - c);
 
     // Pack a large block of the RHS.
-    PackRhs<BitDepthParams>(&packed_rhs, rhs.block(0, c, depth, cs));
+    PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
 
     // Give work to each worker.
+    std::vector<Task*> tasks;
     int next_start_row = 0;
-    workers_pool->counter_to_decrement_when_ready().Reset(workers_count);
-    for (int thread = 0; thread < thread_count; thread++) {
+    for (int n = 0; n < task_count; ++n) {
       int start_row = next_start_row;
       next_start_row = std::min(rows, RoundUp<KernelFormat::kRows>(
-                                          rows * (thread + 1) / thread_count));
+                                          rows * (n + 1) / task_count));
 
       int block_rows = next_start_row - start_row;
       auto lhs_block = lhs.block(start_row, 0, block_rows, depth);
-      auto result_block = result->block(start_row, c, block_rows, cs);
-      typedef GemmWithPackedRhsTask<KernelFormat, InputScalar, OutputScalar,
-                                    BitDepthParams, LhsOrder, RhsOrder,
-                                    ResultOrder, LhsOffset, RhsOffset,
-                                    OutputPipelineType>
+      typedef GemmWithPackedRhsTask<
+          KernelFormat, InputScalar, OutputScalar, BitDepthParams, LhsOrder,
+          RhsOrder, ResultOrder, LhsOffset, RhsOffset, OutputPipelineType,
+          GemmContextType>
           TaskType;
-      auto task = new TaskType(kernel, lhs_block, packed_rhs, &result_block,
-                               lhs_offset, rhs_offset, output_pipeline);
-      if (thread < workers_count) {
-        workers_pool->StartWorker(thread, task);
-      } else {
-        // Execute the remaining workload immediately on the current thread.
-        task->local_allocator = context->main_thread_task_allocator();
-        task->Run();
-        delete task;
-      }
+      tasks.push_back(new TaskType(context, kernel, lhs_block, packed_rhs, result,
+                                   MatrixBlockBounds(start_row, c, block_rows, cs),
+                                   lhs_offset, rhs_offset, output_pipeline));
     }
-    // Wait for the workers.
-    workers_pool->counter_to_decrement_when_ready().Wait();
+    // Execute the work on the workers (and partially on this thread).
+    workers_pool->Execute(tasks);
   }
 
   allocator->Decommit();
diff --git a/internal/output.h b/internal/output.h
index 28c881a..8ccb8ee 100644
--- a/internal/output.h
+++ b/internal/output.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -23,216 +23,204 @@
 #include <tuple>
 #include <type_traits>
 
+#include "../fixedpoint/fixedpoint.h"
 #include "../public/output_stages.h"
-#include "fixedpoint.h"
+#include "simd_wrappers.h"
 
 namespace gemmlowp {
 
-// A Fragment is a small fixed-size matrix typically stored in one or
-// a few architecture-specific SIMD vectors. Besides plain old scalar types
-// such as int32_t, Fragment types are what can be used as input/output data
-// types for output pipeline stages.
-//
-// More details:
-//
-// In the generic scalar code in this file, we have only implemented
-// evaluation of output stages for scalar inputs (e.g. plain int32_t values).
-// Other files (e.g. output_neon.h) are to provide SIMD paths by implementing
-// evaluation of output stages for SIMD vector types. However, this raises
-// the question of how the different values ("lanes") in a SIMD vector
-// correspond to different values in the whole matrices. For simple entry-wise
-// output stages, this doesn't matter, but for other output stages depending
-// on position within the whole matrix, this does matter. To solve this
-// problem, rather than implementing evaluation of output stages for raw
-// SIMD vector types, we wrap SIMD vector types in "fragment" structs that
-// bring the additional structure of "shape" i.e. mapping SIMD lanes to
-// matrix entries, and we specialize evaluation of output stage for such
-// fragment types. The Fragment template struct here is how we generate
-// all fragment structs. For example, in output_neon.h, it may be specialized
-// with DataType=int32x4_t, Rows=4, Cols=1. MapOrder doesn't matter for
-// vector shapes. While Fragment is only used for SIMD paths, we leave it
-// here in this platform-generic file because this same template should
-// cover the needs of any SIMD architectures.
-template <typename tDataType, int tRows, int tCols, MapOrder tOrder>
-struct Fragment {
-  typedef tDataType DataType;
-  static const int kRows = tRows;
-  static const int kCols = tCols;
-  static const MapOrder kOrder = tOrder;
-
-  Fragment() {}
-  Fragment(const DataType& d) : data(d) {}
-  operator DataType() const { return data; }
-
-  DataType data;
-};
-
-typedef Fragment<std::int32_t, 1, 1, MapOrder::ColMajor> FragmentInt32x1x1;
-typedef Fragment<std::uint8_t, 1, 1, MapOrder::ColMajor> FragmentUint8x1x1;
-
-// OutputStageEvalImpl is the template that we specialize to provide
-// implementations of each output stage for each type of input data.
-//
-// Each specialization provides a OutputType typedef and an Eval function
-// returning OutputType. The OutputType typically depends on the InputType.
-//
-// There are two dimensions in which input data types can vary:
-//   1. Different output stages may expect different data types. The
-//      only hard constraint is that the first stage accepts int32, as
-//      the unpack stage produces int32 accumulators.
-//   2. For a given scalar data type such as int32, there is still the
-//      possibility of having SIMD vector types such as NEON int32x4_t,
-//      typically wrapped as "fragment" types, see struct Fragment.
-//      Thus, there can be several OutputStageEvalImpl
-//      specializations for a single OutputStageType, for different
-//      InputType's.
-template <typename OutputStageType, typename InputType>
-struct OutputStageEvalImpl {
+template <typename OutputStage, typename InputBufferType>
+struct OutputStageEvalBufferImpl {
   // This generic template body should never be hit.
   static_assert(
-      std::is_same<InputType, void>::value,
+      std::is_same<InputBufferType, void>::value,
       "Unimplemented: missing implementation of this output pipeline stage "
       "for this data type. This would happen if some architecture-specific "
       "SIMD back-end (output_$arch.h) were incomplete.");
-
-  OutputStageEvalImpl(const OutputStageType&) {}
 };
 
-// Implementation of OutputStageQuantizeDownInt32ToUint8Scale for scalar data
-template <>
-struct OutputStageEvalImpl<OutputStageQuantizeDownInt32ToUint8Scale,
-                           FragmentInt32x1x1> {
-  typedef FragmentInt32x1x1 InputType;
-  typedef FragmentInt32x1x1 OutputType;
-  typedef OutputStageQuantizeDownInt32ToUint8Scale OutputStage;
+template <typename OutputStage, typename InputType>
+struct OutputStageEvalImpl {
+  static constexpr int kRows = InputType::kRows;
+  static constexpr int kCols = InputType::kCols;
+  using InputBufferType = typename InputType::BufferType;
+  using BufferEvalImplType =
+      OutputStageEvalBufferImpl<OutputStage, InputBufferType>;
+  using OutputBufferType = typename BufferEvalImplType::OutputType;
+  using OutputScalarType = typename OutputBufferType::ScalarType;
+  using OutputType = RegisterBlock<OutputScalarType, kRows, kCols>;
 
-  OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
+  OutputStageEvalImpl(const OutputStage& s) : buffer_eval_impl(s) {}
 
   OutputType Eval(InputType input, int, int) const {
-    const std::int32_t result_shift = output_stage.result_shift;
+    OutputType output;
+    output.buf = buffer_eval_impl.Eval(input.buf);
+    return output;
+  }
+
+  const BufferEvalImplType buffer_eval_impl;
+};
+
+template <int Size>
+struct OutputStageEvalBufferImpl<OutputStageQuantizeDownInt32ToUint8Scale,
+                                 RegisterBuffer<std::int32_t, Size>> {
+  using InputType = RegisterBuffer<std::int32_t, Size>;
+  using OutputType = RegisterBuffer<std::int32_t, Size>;
+
+  typedef OutputStageQuantizeDownInt32ToUint8Scale OutputStage;
+
+  OutputStageEvalBufferImpl(const OutputStage& s) : output_stage(s) {}
+
+  OutputType Eval(InputType input) const {
+    const int result_shift = output_stage.result_shift;
     const std::int32_t result_mult_int = output_stage.result_mult_int;
-    const std::int32_t result_offset = output_stage.result_offset;
-    const std::int32_t kRoundingTerm =
-        (result_shift < 1) ? 0 : (1 << (result_shift - 1));
-    return ((input + result_offset) * result_mult_int + kRoundingTerm) >>
-           result_shift;
+    using RegisterType = typename InputType::RegisterType;
+    const RegisterType result_offset =
+        Dup<RegisterType>(output_stage.result_offset);
+    OutputType output;
+    for (int i = 0; i < InputType::kRegisterCount; i++) {
+      output.reg[i] = RoundingDivideByPOT(
+          Mul(Add(input.reg[i], result_offset), result_mult_int), result_shift);
+    }
+    return output;
   }
 
   const OutputStage& output_stage;
 };
 
-template <>
-struct OutputStageEvalImpl<
-    OutputStageQuantizeDownInt32ToUint8ScalePC<VectorShape::Col>,
-    FragmentInt32x1x1> {
-  typedef FragmentInt32x1x1 InputType;
-  typedef FragmentInt32x1x1 OutputType;
-  typedef OutputStageQuantizeDownInt32ToUint8ScalePC<VectorShape::Col>
-      OutputStage;
+template <int Rows, int Cols, VectorShape Shape>
+struct OutputStageEvalImpl<OutputStageQuantizeDownInt32ToUint8ScalePC<Shape>,
+                           RegisterBlock<std::int32_t, Rows, Cols>> {
+  typedef RegisterBlock<std::int32_t, Rows, Cols> InputType;
+  typedef RegisterBlock<std::int32_t, Rows, Cols> OutputType;
+  typedef OutputStageQuantizeDownInt32ToUint8ScalePC<Shape> OutputStage;
 
   OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
 
   OutputType Eval(InputType input, int row, int col) const {
-    const std::int32_t result_shift = output_stage.result_shift;
-    const std::int32_t result_mult_int = output_stage.result_mult_int(row);
-    const std::int32_t result_offset = output_stage.result_offset(row);
-    const std::int32_t kRoundingTerm =
-        (result_shift < 1) ? 0 : (1 << (result_shift - 1));
-    return ((input + result_offset) * result_mult_int + kRoundingTerm) >>
-           result_shift;
+    OutputType output;
+    const int result_shift = output_stage.result_shift;
+    const int pos = Shape == VectorShape::Col ? row : col;
+    const auto result_mult_int =
+        LoadForBroadcasting<InputType>(output_stage.result_mult_int, pos);
+    const auto result_offset =
+        LoadForBroadcasting<InputType>(output_stage.result_offset, pos);
+    const auto dividend = BroadcastMul<InputType>(
+        BroadcastAdd<InputType>(input, result_offset), result_mult_int);
+    for (int i = 0; i < InputType::kRegisterCount; i++) {
+      output.buf.reg[i] =
+          RoundingDivideByPOT(dividend.buf.reg[i], result_shift);
+    }
+    return output;
   }
 
   const OutputStage& output_stage;
 };
 
-template <>
-struct OutputStageEvalImpl<
-    OutputStageQuantizeDownInt32ToUint8ScalePC<VectorShape::Row>,
-    FragmentInt32x1x1> {
-  typedef FragmentInt32x1x1 InputType;
-  typedef FragmentInt32x1x1 OutputType;
-  typedef OutputStageQuantizeDownInt32ToUint8ScalePC<VectorShape::Row>
-      OutputStage;
+template <int Size>
+struct OutputStageEvalBufferImpl<
+    OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint,
+    RegisterBuffer<std::int32_t, Size>> {
+  typedef RegisterBuffer<std::int32_t, Size> InputType;
+  typedef RegisterBuffer<std::int32_t, Size> OutputType;
 
-  OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
+  typedef OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint OutputStage;
 
-  OutputType Eval(InputType input, int row, int col) const {
-    const std::int32_t result_shift = output_stage.result_shift;
-    const std::int32_t result_mult_int = output_stage.result_mult_int(col);
-    const std::int32_t result_offset = output_stage.result_offset(col);
-    const std::int32_t kRoundingTerm =
-        (result_shift < 1) ? 0 : (1 << (result_shift - 1));
-    return ((input + result_offset) * result_mult_int + kRoundingTerm) >>
-           result_shift;
+  OutputStageEvalBufferImpl(const OutputStage& s) : output_stage(s) {}
+
+  OutputType Eval(InputType input) const {
+    OutputType output;
+    using RegisterType = typename InputType::RegisterType;
+    const RegisterType result_offset_after_shift =
+        Dup<RegisterType>(output_stage.result_offset_after_shift);
+    for (int i = 0; i < InputType::kRegisterCount; i++) {
+      const RegisterType mulhigh_val = SaturatingRoundingDoublingHighMul(
+          input.reg[i], output_stage.result_fixedpoint_multiplier);
+      output.reg[i] =
+          Add(RoundingDivideByPOT(mulhigh_val, output_stage.result_shift),
+              result_offset_after_shift);
+    }
+    return output;
   }
 
   const OutputStage& output_stage;
 };
 
 // Implementation of OutputStageSaturatingCastToUint8 for scalar data
-template <>
-struct OutputStageEvalImpl<OutputStageSaturatingCastToUint8,
-                           FragmentInt32x1x1> {
-  typedef FragmentInt32x1x1 InputType;
-  typedef FragmentUint8x1x1 OutputType;
+template <int Size>
+struct OutputStageEvalBufferImpl<OutputStageSaturatingCastToUint8,
+                                 RegisterBuffer<std::int32_t, Size>> {
+  typedef RegisterBuffer<std::int32_t, Size> InputType;
+  typedef RegisterBuffer<std::uint8_t, Size> OutputType;
+  static_assert(InputType::kRegisterLanes == 1,
+                "This path is only for scalar values");
+
   typedef OutputStageSaturatingCastToUint8 OutputStage;
 
-  OutputStageEvalImpl(const OutputStage&) {}
+  OutputStageEvalBufferImpl(const OutputStage&) {}
 
-  OutputType Eval(InputType input, int, int) const {
-    std::int32_t data = input.data;
-    return data > 255 ? 255 : data < 0 ? 0 : data;
+  OutputType Eval(InputType input) const {
+    OutputType output;
+    for (int i = 0; i < InputType::kRegisterCount; i++) {
+      std::int32_t data = input.reg[i];
+      output.reg[i] = data > 255 ? 255 : data < 0 ? 0 : data;
+    }
+    return output;
   }
 };
 
-// Implementation of OutputStageBiasAddition for scalar data
-template <typename VectorType>
+template <int Rows, int Cols, typename VectorType>
 struct OutputStageEvalImpl<OutputStageBiasAddition<VectorType>,
-                           FragmentInt32x1x1> {
-  typedef FragmentInt32x1x1 InputType;
-  typedef FragmentInt32x1x1 OutputType;
+                           RegisterBlock<std::int32_t, Rows, Cols>> {
+  typedef RegisterBlock<std::int32_t, Rows, Cols> InputType;
+  typedef RegisterBlock<std::int32_t, Rows, Cols> OutputType;
   typedef OutputStageBiasAddition<VectorType> OutputStage;
 
   OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
 
   OutputType Eval(InputType input, int row, int col) const {
-    if (VectorType::kShape == VectorShape::Row) {
-      return input + output_stage.bias_vector(col);
-    } else {
-      return input + output_stage.bias_vector(row);
-    }
+    const int pos = VectorType::kShape == VectorShape::Row ? col : row;
+    return BroadcastAdd<InputType>(
+        input, LoadForBroadcasting<InputType>(output_stage.bias_vector, pos));
   }
 
   const OutputStage& output_stage;
 };
 
-// Implementation of OutputStageClamp for scalar data
-template <>
-struct OutputStageEvalImpl<OutputStageClamp, FragmentInt32x1x1> {
-  typedef FragmentInt32x1x1 InputType;
-  typedef FragmentInt32x1x1 OutputType;
+template <int Size>
+struct OutputStageEvalBufferImpl<OutputStageClamp,
+                                 RegisterBuffer<std::int32_t, Size>> {
+  typedef RegisterBuffer<std::int32_t, Size> InputType;
+  typedef RegisterBuffer<std::int32_t, Size> OutputType;
+
   typedef OutputStageClamp OutputStage;
 
-  OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
+  OutputStageEvalBufferImpl(const OutputStage& s) : output_stage(s) {}
 
-  OutputType Eval(InputType input, int, int) const {
-    const std::int32_t min = output_stage.min;
-    const std::int32_t max = output_stage.max;
-    return std::min(std::max(input.data, min), max);
+  OutputType Eval(InputType input) const {
+    using RegisterType = typename InputType::RegisterType;
+    const RegisterType min = Dup<RegisterType>(output_stage.min);
+    const RegisterType max = Dup<RegisterType>(output_stage.max);
+    OutputType output;
+    for (int i = 0; i < InputType::kRegisterCount; i++) {
+      output.reg[i] = Min(Max(input.reg[i], min), max);
+    }
+    return output;
   }
 
   const OutputStage& output_stage;
 };
 
-// Implementation of OutputStageTanh for either scalar or SIMD data
-template <typename tInputType>
-struct OutputStageTanhEvalImpl {
-  typedef tInputType InputType;
-  typedef InputType OutputType;
-  typedef typename InputType::DataType DataType;
+template <int Size>
+struct OutputStageEvalBufferImpl<OutputStageTanh,
+                                 RegisterBuffer<std::int32_t, Size>> {
+  typedef RegisterBuffer<std::int32_t, Size> InputType;
+  typedef RegisterBuffer<std::int32_t, Size> OutputType;
+  using RegisterType = typename InputType::RegisterType;
+  typedef RegisterType DataType;
   typedef OutputStageTanh OutputStage;
 
-  OutputStageTanhEvalImpl(const OutputStage& s) : output_stage(s) {
+  OutputStageEvalBufferImpl(const OutputStage& s) : output_stage(s) {
     const std::int32_t real_zero_as_int32 = output_stage.real_zero_as_int32;
     const std::int32_t real_amplitude_as_int32 =
         output_stage.real_amplitude_as_int32;
@@ -248,8 +236,8 @@
       inverse_amplitude_normalized_double *= 2;
       inverse_amplitude_neg_exponent++;
     }
-    inverse_amplitude_normalized =
-        ToFixedPoint<DataType, 0>(inverse_amplitude_normalized_double);
+    inverse_amplitude_normalized = FixedPoint<DataType, 0>::FromDouble(
+        inverse_amplitude_normalized_double);
 
     double amplitude_normalized_double = real_amplitude_as_int32;
     amplitude_exponent = 0;
@@ -258,39 +246,44 @@
       amplitude_exponent++;
     }
     amplitude_normalized =
-        ToFixedPoint<DataType, 0>(amplitude_normalized_double);
+        FixedPoint<DataType, 0>::FromDouble(amplitude_normalized_double);
   }
 
-  OutputType Eval(InputType input, int, int) const {
+  OutputType Eval(InputType input) const {
     const std::int32_t real_zero_as_int32 = output_stage.real_zero_as_int32;
 
     typedef FixedPoint<DataType, 3> F3;
     typedef FixedPoint<DataType, 0> F0;
 
-    // fixed-point affine transformation
-    DataType input_centered =
-        Sub(input.data, Dup<DataType>(real_zero_as_int32));
-    F3 fixedpoint_input =
-        F3::FromRaw(input_centered) * inverse_amplitude_normalized;
-    // left shift
-    fixedpoint_input.raw() =
-        ShiftLeft(fixedpoint_input.raw(), 28 - inverse_amplitude_neg_exponent);
-    // fixed-point tanh and multiplication
-    F0 fixedpoint_output = tanh(fixedpoint_input) * amplitude_normalized;
-    // right shift
-    DataType int32_output =
-        Add(Dup<DataType>(real_zero_as_int32),
-            ShiftRight(fixedpoint_output.raw(), 31 - amplitude_exponent));
+    OutputType output;
 
-    DataType mask_if_below_cutoff_min =
-        MaskIfLessThanOrEqual(input.data, Dup<DataType>(input_cutoff_min));
-    DataType mask_if_above_cutoff_max =
-        MaskIfGreaterThanOrEqual(input.data, Dup<DataType>(input_cutoff_max));
+    for (int i = 0; i < OutputType::kRegisterCount; i++) {
+      // fixed-point affine transformation
+      DataType input_centered =
+          Sub(input.reg[i], Dup<DataType>(real_zero_as_int32));
+      F3 fixedpoint_input =
+          F3::FromRaw(input_centered) * inverse_amplitude_normalized;
+      // left shift
+      fixedpoint_input.raw() = ShiftLeft(fixedpoint_input.raw(),
+                                         28 - inverse_amplitude_neg_exponent);
+      // fixed-point tanh and multiplication
+      F0 fixedpoint_output = tanh(fixedpoint_input) * amplitude_normalized;
+      // right shift
+      DataType int32_output =
+          Add(Dup<DataType>(real_zero_as_int32),
+              ShiftRight(fixedpoint_output.raw(), 31 - amplitude_exponent));
 
-    return SelectUsingMask(
-        mask_if_below_cutoff_min, Dup<DataType>(output_min),
-        SelectUsingMask(mask_if_above_cutoff_max, Dup<DataType>(output_max),
-                        int32_output));
+      DataType mask_if_below_cutoff_min =
+          MaskIfLessThanOrEqual(input.reg[i], Dup<DataType>(input_cutoff_min));
+      DataType mask_if_above_cutoff_max = MaskIfGreaterThanOrEqual(
+          input.reg[i], Dup<DataType>(input_cutoff_max));
+
+      output.reg[i] = SelectUsingMask(
+          mask_if_below_cutoff_min, Dup<DataType>(output_min),
+          SelectUsingMask(mask_if_above_cutoff_max, Dup<DataType>(output_max),
+                          int32_output));
+    }
+    return output;
   }
 
   const OutputStage& output_stage;
@@ -302,13 +295,6 @@
   int amplitude_exponent;
 };
 
-template <>
-struct OutputStageEvalImpl<OutputStageTanh, FragmentInt32x1x1>
-    : OutputStageTanhEvalImpl<FragmentInt32x1x1> {
-  OutputStageEvalImpl(const OutputStageTanh& output_stage)
-      : OutputStageTanhEvalImpl(output_stage) {}
-};
-
 // OutputPipelineOutputType is a helper to determine the output data type of a
 // pipeline, for a
 // given input data type. It is a recursive template; see the explanation on
@@ -377,13 +363,32 @@
   }
 };
 
+template <typename RegisterBlockType, typename DstType>
+struct StoreFinalOutputImpl {
+  static_assert(std::is_same<RegisterBlockType, void>::value,
+                "This generic impl should never be hit");
+};
+
+template <typename ScalarType, int Rows, int Cols, typename DstType>
+struct StoreFinalOutputImpl<RegisterBlock<ScalarType, Rows, Cols>, DstType> {
+  using RegisterBlockType = RegisterBlock<ScalarType, Rows, Cols>;
+  static void Run(const RegisterBlockType& src, DstType* dst, int row,
+                  int col) {
+    for (int r = 0; r < Rows; r++) {
+      for (int c = 0; c < Cols; c++) {
+        *dst->data(row + r, col + c) = src.buf.reg[r + c * Rows];
+      }
+    }
+  }
+};
+
 // StoreFinalOutput takes the final value at the end of the output pipeline and
 // stores it into the destination matrix. It can be specialized for different
 // data types; the generic implementation here is typically used only for plain
 // old scalar (not SIMD) types.
-template <typename OutputType, typename DstType>
-void StoreFinalOutput(OutputType value, DstType* dst, int row, int col) {
-  *dst->data(row, col) = value;
+template <typename RegisterBlockType, typename DstType>
+void StoreFinalOutput(RegisterBlockType src, DstType* dst, int row, int col) {
+  StoreFinalOutputImpl<RegisterBlockType, DstType>::Run(src, dst, row, col);
 }
 
 template <typename OutputPipelineType, typename InputType>
@@ -396,20 +401,23 @@
   // result
   // of the unpack stage and stores it into the destination matrix.
   template <typename DstType>
-  void Execute(InputType input, DstType* dst, int row, int col) {
+  void Execute(InputType input, DstType* dst, int src_global_row,
+               int src_global_col, int dst_row, int dst_col) const {
     // Statically assert that the output pipeline matches the given destination
     // matrix's scalar type.
-    typedef typename OutputPipelineOutputType<OutputPipelineType, 0,
-                                              FragmentInt32x1x1>::Type::DataType
+    typedef typename OutputPipelineOutputType<
+        OutputPipelineType, 0, InputType>::Type::BufferType::ScalarType
+
         ScalarOutputType;
     typedef typename DstType::Scalar ScalarDstType;
     static_assert(std::is_same<ScalarOutputType, ScalarDstType>::value,
                   "mismatched destination scalar type and output pipeline");
 
     // Evaluate the output pipeline.
-    auto output = output_pipeline_eval_impl_.Eval(input, row, col);
+    auto output =
+        output_pipeline_eval_impl_.Eval(input, src_global_row, src_global_col);
     // Store the result into the destination matrix.
-    StoreFinalOutput(output, dst, row, col);
+    StoreFinalOutput(output, dst, dst_row, dst_col);
   }
 
   const OutputPipelineEvalImpl<OutputPipelineType, 0, InputType>
@@ -418,4 +426,10 @@
 
 }  // namespace gemmlowp
 
+#ifdef GEMMLOWP_NEON
+#include "output_neon.h"
+#elif defined(GEMMLOWP_SSE4)
+#include "output_sse.h"
+#endif
+
 #endif  // GEMMLOWP_INTERNAL_OUTPUT_H_
diff --git a/internal/output_neon.h b/internal/output_neon.h
index ed5f57c..7e111e5 100644
--- a/internal/output_neon.h
+++ b/internal/output_neon.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -23,257 +23,410 @@
 
 namespace gemmlowp {
 
-// Definitions of Fragment types wrapping NEON vector types.
-typedef Fragment<int32x4_t, 4, 1, MapOrder::ColMajor> NEONFragmentInt32x4x1;
-typedef Fragment<int32x4x4_t, 16, 1, MapOrder::ColMajor> NEONFragmentInt32x16x1;
-typedef Fragment<uint8x8_t, 4, 1, MapOrder::ColMajor> NEONFragmentUint8x4x1;
-typedef Fragment<uint8x16_t, 16, 1, MapOrder::ColMajor> NEONFragmentUint8x16x1;
+template <>
+struct OutputStageEvalBufferImpl<OutputStageSaturatingCastToUint8,
+                                 RegBufferInt32<4>> {
+  typedef RegBufferInt32<4> InputType;
+  typedef RegBufferUint8<4> OutputType;
 
-// The code in unpack_neon.h will whenever possible process
-// 16 entries at once (4 SIMD vectors of 4 entries each at once),
-// to offer the compiler better optimization opportunities, reducing
-// register dependencies. From the perspective of interfacing with the output
-// pipeline, this takes the form of passing Fragment types wrapping int32x4x4_t
-// data. In most cases, such data is handled simply by handling separately its
-// 4 int32x4_t components. This partial specialization handles that for
-// arbitrary output stages implementing a int32x4_t path. Only some output
-// stages below will override this to use custom code to handle int32x4x4_t
-// data all at once (see OutputStageSaturatingCastToUint8 below).
-template <typename OutputStageType>
-struct OutputStageEvalImpl<OutputStageType, NEONFragmentInt32x16x1> {
-  typedef NEONFragmentInt32x16x1 InputType;
-  typedef NEONFragmentInt32x16x1 OutputType;
-  typedef OutputStageEvalImpl<OutputStageType, NEONFragmentInt32x4x1>
-      ImplInt32x4;
-  OutputStageEvalImpl(const OutputStageType& s) : impl_int32x4(s) {}
+  typedef OutputStageSaturatingCastToUint8 OutputStage;
 
-  OutputType Eval(InputType input, int row, int col) const {
+  OutputStageEvalBufferImpl(const OutputStage&) {}
+
+  OutputType Eval(InputType input) const {
     OutputType output;
+    int16x4_t res_16 = vqmovn_s32(input.reg[0]);
+    uint8x8_t res_8 = vqmovun_s16(vcombine_s16(res_16, res_16));
+    output.reg[0] = vget_lane_u32(vreinterpret_u32_u8(res_8), 0);
+    return output;
+  }
+};
 
+template <>
+struct OutputStageEvalBufferImpl<OutputStageSaturatingCastToUint8,
+                                 RegBufferInt32<8>> {
+  typedef RegBufferInt32<8> InputType;
+  typedef RegBufferUint8<8> OutputType;
+
+  typedef OutputStageSaturatingCastToUint8 OutputStage;
+
+  OutputStageEvalBufferImpl(const OutputStage&) {}
+
+  OutputType Eval(InputType input) const {
+    OutputType output;
+    int16x8_t res_16 =
+        vcombine_s16(vqmovn_s32(input.reg[0]), vqmovn_s32(input.reg[1]));
+    output.reg[0] = vqmovun_s16(res_16);
+    return output;
+  }
+};
+
+template <>
+struct OutputStageEvalBufferImpl<OutputStageSaturatingCastToUint8,
+                                 RegBufferInt32<16>> {
+  typedef RegBufferInt32<16> InputType;
+  typedef RegBufferUint8<16> OutputType;
+
+  typedef OutputStageSaturatingCastToUint8 OutputStage;
+
+  OutputStageEvalBufferImpl(const OutputStage&) {}
+
+  OutputType Eval(InputType input) const {
+    OutputType output;
+    int16x8_t res_16_0 =
+        vcombine_s16(vqmovn_s32(input.reg[0]), vqmovn_s32(input.reg[1]));
+    int16x8_t res_16_1 =
+        vcombine_s16(vqmovn_s32(input.reg[2]), vqmovn_s32(input.reg[3]));
+    output.reg[0] = vqmovun_s16(res_16_0);
+    output.reg[1] = vqmovun_s16(res_16_1);
+    return output;
+  }
+};
+
+template <>
+struct OutputStageEvalBufferImpl<OutputStageSaturatingCastToUint8,
+                                 RegBufferInt32<32>> {
+  typedef RegBufferInt32<32> InputType;
+  typedef RegBufferUint8<32> OutputType;
+
+  typedef OutputStageSaturatingCastToUint8 OutputStage;
+
+  OutputStageEvalBufferImpl(const OutputStage&) {}
+
+  OutputType Eval(InputType input) const {
+    OutputType output;
+    int16x8_t res_16[4];
     for (int i = 0; i < 4; i++) {
-      output.data.val[i] =
-          impl_int32x4.Eval(input.data.val[i], row + 4 * i, col);
+      res_16[i] = vcombine_s16(vqmovn_s32(input.reg[2 * i]),
+                               vqmovn_s32(input.reg[2 * i + 1]));
+    }
+    for (int i = 0; i < 4; i++) {
+      output.reg[i] = vqmovun_s16(res_16[i]);
     }
     return output;
   }
-
-  ImplInt32x4 impl_int32x4;
 };
 
-// Implementation of OutputStageQuantizeDownInt32ToUint8Scale for
-// NEONFragmentInt32x4x1
-template <>
-struct OutputStageEvalImpl<OutputStageQuantizeDownInt32ToUint8Scale,
-                           NEONFragmentInt32x4x1> {
-  typedef NEONFragmentInt32x4x1 InputType;
-  typedef NEONFragmentInt32x4x1 OutputType;
-  typedef OutputStageQuantizeDownInt32ToUint8Scale OutputStage;
-
-  OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
-
-  OutputType Eval(InputType input, int, int) const {
-    const std::int32_t result_shift = output_stage.result_shift;
-    const std::int32_t result_mult_int = output_stage.result_mult_int;
-    const std::int32_t result_offset = output_stage.result_offset;
-    const std::int32_t preshift_offset =
-        (result_shift < 1) ? 0 : (1 << (result_shift - 1));
-    const int32x4_t a = vaddq_s32(input, vdupq_n_s32(result_offset));
-    const int32x4_t b =
-        vmlaq_n_s32(vdupq_n_s32(preshift_offset), a, result_mult_int);
-    return vshlq_s32(b, vdupq_n_s32(-result_shift));
-  }
-
-  const OutputStage& output_stage;
-};
-
-// Implementation of OutputStageQuantizeDownInt32ToUint8ScalePC for
-// NEONFragmentInt32x4x1
-template <>
-struct OutputStageEvalImpl<
-    OutputStageQuantizeDownInt32ToUint8ScalePC<VectorShape::Col>,
-    NEONFragmentInt32x4x1> {
-  typedef NEONFragmentInt32x4x1 InputType;
-  typedef NEONFragmentInt32x4x1 OutputType;
-  typedef OutputStageQuantizeDownInt32ToUint8ScalePC<VectorShape::Col>
-      OutputStage;
-
-  OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
-
-  OutputType Eval(InputType input, int row, int col) const {
-    const std::int32_t result_shift = output_stage.result_shift;
-    const std::int32_t preshift_offset =
-        (result_shift < 1) ? 0 : (1 << (result_shift - 1));
-    const int32x4_t result_mult_int =
-        vld1q_s32(output_stage.result_mult_int.data(row));
-    const int32x4_t result_offset =
-        vld1q_s32(output_stage.result_offset.data(row));
-    const int32x4_t a = vaddq_s32(input, result_offset);
-    const int32x4_t b =
-        vmlaq_s32(vdupq_n_s32(preshift_offset), a, result_mult_int);
-    return vshlq_s32(b, vdupq_n_s32(-result_shift));
-  }
-
-  const OutputStage& output_stage;
-};
-
-// Implementation of OutputStageQuantizeDownInt32ToUint8ScalePC for
-// NEONFragmentInt32x4x1
-template <>
-struct OutputStageEvalImpl<
-    OutputStageQuantizeDownInt32ToUint8ScalePC<VectorShape::Row>,
-    NEONFragmentInt32x4x1> {
-  typedef NEONFragmentInt32x4x1 InputType;
-  typedef NEONFragmentInt32x4x1 OutputType;
-  typedef OutputStageQuantizeDownInt32ToUint8ScalePC<VectorShape::Row>
-      OutputStage;
-
-  OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
-
-  OutputType Eval(InputType input, int row, int col) const {
-    const std::int32_t result_shift = output_stage.result_shift;
-    const std::int32_t preshift_offset =
-        (result_shift < 1) ? 0 : (1 << (result_shift - 1));
-    const int32x4_t result_mult_int =
-        vld1q_s32(output_stage.result_mult_int.data(col));
-    const int32x4_t result_offset =
-        vld1q_s32(output_stage.result_offset.data(row));
-    const int32x4_t a = vaddq_s32(input, result_offset);
-    const int32x4_t b =
-        vmlaq_s32(vdupq_n_s32(preshift_offset), a, result_mult_int);
-    return vshlq_s32(b, vdupq_n_s32(-result_shift));
-  }
-
-  const OutputStage& output_stage;
-};
-
-// Implementation of OutputStageSaturatingCastToUint8 for NEONFragmentInt32x4x1
-template <>
-struct OutputStageEvalImpl<OutputStageSaturatingCastToUint8,
-                           NEONFragmentInt32x4x1> {
-  typedef NEONFragmentInt32x4x1 InputType;
-  typedef NEONFragmentUint8x4x1 OutputType;
-  typedef OutputStageSaturatingCastToUint8 OutputStage;
-
-  OutputStageEvalImpl(const OutputStage&) {}
-
-  OutputType Eval(InputType input, int, int) const {
-    int16x8_t q16 = vcombine_s16(vqmovn_s32(input), vdup_n_s16(0));
-    return vqmovun_s16(q16);
-  }
-};
-
-// In the case of OutputStageSaturatingCastToUint8, the handling of
-// NEONFragmentInt32x16x1 data can be made much more efficient by handling
-// it all at once, instead of as 4 separate int32x4 values as in the above
-// generic partial specialization. This also avoids the poor (50%) register
-// utilization of FragmentUint8x4x1: by handling 16 scalar values at once,
-// we are able to fill a uint8x16_t.
-template <>
-struct OutputStageEvalImpl<OutputStageSaturatingCastToUint8,
-                           NEONFragmentInt32x16x1> {
-  typedef NEONFragmentInt32x16x1 InputType;
-  typedef NEONFragmentUint8x16x1 OutputType;
-  typedef OutputStageSaturatingCastToUint8 OutputStage;
-
-  OutputStageEvalImpl(const OutputStage&) {}
-
-  OutputType Eval(InputType input, int, int) const {
-    int16x8_t q16[2];
-    for (int i = 0; i < 2; i++) {
-      q16[i] = vcombine_s16(vqmovn_s32(input.data.val[2 * i]),
-                            vqmovn_s32(input.data.val[2 * i + 1]));
-    }
-    return vcombine_u8(vqmovun_s16(q16[0]), vqmovun_s16(q16[1]));
-  }
-};
-
-// Implementation of OutputStageBiasAddition for NEONFragmentInt32x4x1
-template <typename VectorType>
-struct OutputStageEvalImpl<OutputStageBiasAddition<VectorType>,
-                           NEONFragmentInt32x4x1> {
-  typedef NEONFragmentInt32x4x1 InputType;
-  typedef NEONFragmentInt32x4x1 OutputType;
-  typedef OutputStageBiasAddition<VectorType> OutputStage;
-
-  OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
-
-  OutputType Eval(InputType input, int row, int col) const {
-    int32x4_t bias;
-    if (VectorType::kShape == VectorShape::Row) {
-      bias = vdupq_n_s32(output_stage.bias_vector(col));
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<8, 1>, DstType> {
+  static void Run(const RegBlockInt32<8, 1>& src, DstType* dst, int row,
+                  int col) {
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      StoreInt32x4(dst->data(row, col), src.buf.reg[0]);
+      StoreInt32x4(dst->data(row + 4, col), src.buf.reg[1]);
     } else {
-      bias = vld1q_s32(output_stage.bias_vector.data(row));
+      *dst->data(row + 0, col) = GetLane<0>(src.buf.reg[0]);
+      *dst->data(row + 1, col) = GetLane<1>(src.buf.reg[0]);
+      *dst->data(row + 2, col) = GetLane<2>(src.buf.reg[0]);
+      *dst->data(row + 3, col) = GetLane<3>(src.buf.reg[0]);
+      *dst->data(row + 4, col) = GetLane<0>(src.buf.reg[1]);
+      *dst->data(row + 5, col) = GetLane<1>(src.buf.reg[1]);
+      *dst->data(row + 6, col) = GetLane<2>(src.buf.reg[1]);
+      *dst->data(row + 7, col) = GetLane<3>(src.buf.reg[1]);
     }
-    return vaddq_s32(input, bias);
   }
-
-  const OutputStage& output_stage;
 };
 
-// Implementation of OutputStageClamp for NEONFragmentInt32x4x1
-template <>
-struct OutputStageEvalImpl<OutputStageClamp, NEONFragmentInt32x4x1> {
-  typedef NEONFragmentInt32x4x1 InputType;
-  typedef NEONFragmentInt32x4x1 OutputType;
-  typedef OutputStageClamp OutputStage;
+inline RegBlockInt32<4, 4> Transpose(const RegBlockInt32<4, 4>& src) {
+  const int32x4x2_t t0 = vtrnq_s32(src.buf.reg[0], src.buf.reg[1]);
+  const int32x4x2_t t1 = vtrnq_s32(src.buf.reg[2], src.buf.reg[3]);
+  RegBlockInt32<4, 4> result;
+  result.buf.reg[0] =
+      vcombine_s32(vget_low_s32(t0.val[0]), vget_low_s32(t1.val[0]));
+  result.buf.reg[1] =
+      vcombine_s32(vget_low_s32(t0.val[1]), vget_low_s32(t1.val[1]));
+  result.buf.reg[2] =
+      vcombine_s32(vget_high_s32(t0.val[0]), vget_high_s32(t1.val[0]));
+  result.buf.reg[3] =
+      vcombine_s32(vget_high_s32(t0.val[1]), vget_high_s32(t1.val[1]));
+  return result;
+}
 
-  OutputStageEvalImpl(const OutputStage& s) : output_stage(s) {}
-
-  OutputType Eval(InputType input, int, int) const {
-    const int32x4_t min = vdupq_n_s32(output_stage.min);
-    const int32x4_t max = vdupq_n_s32(output_stage.max);
-    return vminq_s32(vmaxq_s32(input, min), max);
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<4, 4>, DstType> {
+  static void Run(const RegBlockInt32<4, 4>& src, DstType* dst, int row,
+                  int col) {
+    const auto& block =
+        DstType::kOrder == MapOrder::ColMajor ? src : Transpose(src);
+    std::int32_t* dst_ptr = dst->data(row, col);
+    int stride = dst->stride();
+    for (int i = 0; i < 4; i++) {
+      vst1q_s32(dst_ptr + i * stride, block.buf.reg[i]);
+    }
   }
-
-  const OutputStage& output_stage;
 };
 
-// Implementation of OutputStageTanh for NEONFragmentInt32x4x1
-template <>
-struct OutputStageEvalImpl<OutputStageTanh, NEONFragmentInt32x4x1>
-    : OutputStageTanhEvalImpl<NEONFragmentInt32x4x1> {
-  OutputStageEvalImpl(const OutputStageTanh& output_stage)
-      : OutputStageTanhEvalImpl(output_stage) {}
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<8, 4>, DstType> {
+  static void Run(const RegBlockInt32<8, 4>& src, DstType* dst, int row,
+                  int col) {
+    std::int32_t* dst_ptr = dst->data(row, col);
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      int col_stride = dst->cols_stride();
+      for (int i = 0; i < 4; i++) {
+        vst1q_s32(dst_ptr + i * col_stride + 0, src.buf.reg[2 * i + 0]);
+        vst1q_s32(dst_ptr + i * col_stride + 4, src.buf.reg[2 * i + 1]);
+      }
+    } else {
+      int row_stride = dst->rows_stride();
+      RegBlockInt32<4, 4> top;
+      top.buf.reg[0] = src.buf.reg[0];
+      top.buf.reg[1] = src.buf.reg[2];
+      top.buf.reg[2] = src.buf.reg[4];
+      top.buf.reg[3] = src.buf.reg[6];
+      const auto transpose_top = Transpose(top);
+      for (int i = 0; i < 4; i++) {
+        vst1q_s32(dst_ptr + i * row_stride, transpose_top.buf.reg[i]);
+      }
+      RegBlockInt32<4, 4> bottom;
+      bottom.buf.reg[0] = src.buf.reg[1];
+      bottom.buf.reg[1] = src.buf.reg[3];
+      bottom.buf.reg[2] = src.buf.reg[5];
+      bottom.buf.reg[3] = src.buf.reg[7];
+      const auto transpose_bottom = Transpose(bottom);
+      for (int i = 0; i < 4; i++) {
+        vst1q_s32(dst_ptr + (i + 4) * row_stride, transpose_bottom.buf.reg[i]);
+      }
+    }
+  }
 };
 
-// Specialization of StoreFinalOutput for NEONFragmentUint8x4x1.
-// This is quite inefficient, but we have no choice: instructions storing 32bit
-// at once also assume 32bit alignment. In practice, this slowness is not a
-// problem because we use the x16 path for most values.
 template <typename DstType>
-inline void StoreFinalOutput(NEONFragmentUint8x4x1 value, DstType* dst, int row,
-                             int col) {
-  vst1_lane_u8(dst->data(row + 0, col), value, 0);
-  vst1_lane_u8(dst->data(row + 1, col), value, 1);
-  vst1_lane_u8(dst->data(row + 2, col), value, 2);
-  vst1_lane_u8(dst->data(row + 3, col), value, 3);
-}
-
-// Specialization of StoreFinalOutput for NEONFragmentUint8x16x1.
-template <typename DstType>
-inline void StoreFinalOutput(NEONFragmentUint8x16x1 value, DstType* dst,
-                             int row, int col) {
-  vst1q_u8(dst->data(row, col), value);
-}
-
-// Specialization of StoreFinalOutput for NEONFragmentInt32x4x1, storing into a
-// int32 destination.
-template <typename DstType>
-inline void StoreFinalOutput(NEONFragmentInt32x4x1 value, DstType* dst, int row,
-                             int col) {
-  vst1q_s32(dst->data(row, col), value);
-}
-
-// Specialization of StoreFinalOutput for NEONFragmentInt32x16x1, storing into
-// a int32 destination.
-template <typename DstType>
-inline void StoreFinalOutput(NEONFragmentInt32x16x1 value, DstType* dst,
-                             int row, int col) {
-  for (int i = 0; i < 4; i++) {
-    vst1q_s32(dst->data(row + 4 * i, col), value.data.val[i]);
+struct StoreFinalOutputImpl<RegBlockInt32<8, 8>, DstType> {
+  static void Run(const RegBlockInt32<8, 8>& src, DstType* dst, int row,
+                  int col) {
+    std::int32_t* dst_ptr = dst->data(row, col);
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      int col_stride = dst->cols_stride();
+      for (int i = 0; i < 8; i++) {
+        vst1q_s32(dst_ptr + i * col_stride, src.buf.reg[2 * i]);
+        vst1q_s32(dst_ptr + i * col_stride + 4, src.buf.reg[2 * i + 1]);
+      }
+    } else {
+      int row_stride = dst->rows_stride();
+      RegBlockInt32<4, 4> top_left;
+      top_left.buf.reg[0] = src.buf.reg[0];
+      top_left.buf.reg[1] = src.buf.reg[2];
+      top_left.buf.reg[2] = src.buf.reg[4];
+      top_left.buf.reg[3] = src.buf.reg[6];
+      const auto transpose_top_left = Transpose(top_left);
+      for (int i = 0; i < 4; i++) {
+        vst1q_s32(dst_ptr + i * row_stride, transpose_top_left.buf.reg[i]);
+      }
+      RegBlockInt32<4, 4> bottom_left;
+      bottom_left.buf.reg[0] = src.buf.reg[1];
+      bottom_left.buf.reg[1] = src.buf.reg[3];
+      bottom_left.buf.reg[2] = src.buf.reg[5];
+      bottom_left.buf.reg[3] = src.buf.reg[7];
+      const auto transpose_bottom_left = Transpose(bottom_left);
+      for (int i = 0; i < 4; i++) {
+        vst1q_s32(dst_ptr + (i + 4) * row_stride,
+                  transpose_bottom_left.buf.reg[i]);
+      }
+      RegBlockInt32<4, 4> top_right;
+      top_right.buf.reg[0] = src.buf.reg[8];
+      top_right.buf.reg[1] = src.buf.reg[10];
+      top_right.buf.reg[2] = src.buf.reg[12];
+      top_right.buf.reg[3] = src.buf.reg[14];
+      const auto transpose_top_right = Transpose(top_right);
+      for (int i = 0; i < 4; i++) {
+        vst1q_s32(dst_ptr + i * row_stride + 4, transpose_top_right.buf.reg[i]);
+      }
+      RegBlockInt32<4, 4> bottom_right;
+      bottom_right.buf.reg[0] = src.buf.reg[9];
+      bottom_right.buf.reg[1] = src.buf.reg[11];
+      bottom_right.buf.reg[2] = src.buf.reg[13];
+      bottom_right.buf.reg[3] = src.buf.reg[15];
+      const auto transpose_bottom_right = Transpose(bottom_right);
+      for (int i = 0; i < 4; i++) {
+        vst1q_s32(dst_ptr + (i + 4) * row_stride + 4,
+                  transpose_bottom_right.buf.reg[i]);
+      }
+    }
   }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<4, 1>, DstType> {
+  static void Run(const RegBlockInt32<4, 1>& src, DstType* dst, int row,
+                  int col) {
+    std::int32_t* dst_ptr = dst->data(row, col);
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      vst1q_s32(dst_ptr, src.buf.reg[0]);
+    } else {
+      int row_stride = dst->rows_stride();
+      vst1q_lane_s32(dst_ptr + 0 * row_stride, src.buf.reg[0], 0);
+      vst1q_lane_s32(dst_ptr + 1 * row_stride, src.buf.reg[0], 1);
+      vst1q_lane_s32(dst_ptr + 2 * row_stride, src.buf.reg[0], 2);
+      vst1q_lane_s32(dst_ptr + 3 * row_stride, src.buf.reg[0], 3);
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<1, 4>, DstType> {
+  static void Run(const RegBlockInt32<1, 4>& src, DstType* dst, int row,
+                  int col) {
+    std::int32_t* dst_ptr = dst->data(row, col);
+    if (DstType::kOrder == MapOrder::RowMajor) {
+      vst1q_s32(dst_ptr, src.buf.reg[0]);
+    } else {
+      int col_stride = dst->cols_stride();
+      vst1q_lane_s32(dst_ptr + 0 * col_stride, src.buf.reg[0], 0);
+      vst1q_lane_s32(dst_ptr + 1 * col_stride, src.buf.reg[0], 1);
+      vst1q_lane_s32(dst_ptr + 2 * col_stride, src.buf.reg[0], 2);
+      vst1q_lane_s32(dst_ptr + 3 * col_stride, src.buf.reg[0], 3);
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<4, 1>, DstType> {
+  static void Run(const RegBlockUint8<4, 1>& src, DstType* dst, int row,
+                  int col) {
+    const std::uint32_t src_reg = src.buf.reg[0];
+    for (int i = 0; i < 4; i++) {
+      *dst->data(row + i, col) = (src_reg >> (8 * i));
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<1, 4>, DstType> {
+  static void Run(const RegBlockUint8<1, 4>& src, DstType* dst, int row,
+                  int col) {
+    for (int i = 0; i < 4; i++) {
+      *dst->data(row, col + i) = (src.buf.reg[0] >> (8 * i));
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<8, 1>, DstType> {
+  static void Run(const RegBlockUint8<8, 1>& src, DstType* dst, int row,
+                  int col) {
+    std::uint8_t* dst_ptr = dst->data(row, col);
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      vst1_u8(dst_ptr, src.buf.reg[0]);
+    } else {
+      const int row_stride = dst->rows_stride();
+      vst1_lane_u8(dst_ptr + 0 * row_stride, src.buf.reg[0], 0);
+      vst1_lane_u8(dst_ptr + 1 * row_stride, src.buf.reg[0], 1);
+      vst1_lane_u8(dst_ptr + 2 * row_stride, src.buf.reg[0], 2);
+      vst1_lane_u8(dst_ptr + 3 * row_stride, src.buf.reg[0], 3);
+      vst1_lane_u8(dst_ptr + 4 * row_stride, src.buf.reg[0], 4);
+      vst1_lane_u8(dst_ptr + 5 * row_stride, src.buf.reg[0], 5);
+      vst1_lane_u8(dst_ptr + 6 * row_stride, src.buf.reg[0], 6);
+      vst1_lane_u8(dst_ptr + 7 * row_stride, src.buf.reg[0], 7);
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<4, 4>, DstType> {
+  static void Run(const RegBlockUint8<4, 4>& src, DstType* dst, int row,
+                  int col) {
+    std::uint8_t* dst_ptr = dst->data(row, col);
+    const int row_stride = dst->rows_stride();
+    const int col_stride = dst->cols_stride();
+    for (int i = 0; i < 2; i++) {
+      vst1_lane_u8(dst_ptr + 0 * row_stride + (2 * i + 0) * col_stride,
+                   src.buf.reg[i], 0);
+      vst1_lane_u8(dst_ptr + 1 * row_stride + (2 * i + 0) * col_stride,
+                   src.buf.reg[i], 1);
+      vst1_lane_u8(dst_ptr + 2 * row_stride + (2 * i + 0) * col_stride,
+                   src.buf.reg[i], 2);
+      vst1_lane_u8(dst_ptr + 3 * row_stride + (2 * i + 0) * col_stride,
+                   src.buf.reg[i], 3);
+      vst1_lane_u8(dst_ptr + 0 * row_stride + (2 * i + 1) * col_stride,
+                   src.buf.reg[i], 4);
+      vst1_lane_u8(dst_ptr + 1 * row_stride + (2 * i + 1) * col_stride,
+                   src.buf.reg[i], 5);
+      vst1_lane_u8(dst_ptr + 2 * row_stride + (2 * i + 1) * col_stride,
+                   src.buf.reg[i], 6);
+      vst1_lane_u8(dst_ptr + 3 * row_stride + (2 * i + 1) * col_stride,
+                   src.buf.reg[i], 7);
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<8, 4>, DstType> {
+  static void Run(const RegBlockUint8<8, 4>& src, DstType* dst, int row,
+                  int col) {
+    std::uint8_t* dst_ptr = dst->data(row, col);
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      int col_stride = dst->cols_stride();
+      for (int i = 0; i < 4; i++) {
+        vst1_u8(dst_ptr + i * col_stride, src.buf.reg[i]);
+      }
+    } else {
+      for (int i = 0; i < 4; i++) {
+        int row_stride = dst->rows_stride();
+        std::uint8_t* col_ptr = dst_ptr + i;
+        vst1_lane_u8(col_ptr + 0 * row_stride, src.buf.reg[i], 0);
+        vst1_lane_u8(col_ptr + 1 * row_stride, src.buf.reg[i], 1);
+        vst1_lane_u8(col_ptr + 2 * row_stride, src.buf.reg[i], 2);
+        vst1_lane_u8(col_ptr + 3 * row_stride, src.buf.reg[i], 3);
+        vst1_lane_u8(col_ptr + 4 * row_stride, src.buf.reg[i], 4);
+        vst1_lane_u8(col_ptr + 5 * row_stride, src.buf.reg[i], 5);
+        vst1_lane_u8(col_ptr + 6 * row_stride, src.buf.reg[i], 6);
+        vst1_lane_u8(col_ptr + 7 * row_stride, src.buf.reg[i], 7);
+      }
+    }
+  }
+};
+
+inline RegBlockUint8<8, 8> Transpose(const RegBlockUint8<8, 8>& src) {
+  uint8x8x2_t a[4];
+  a[0] = vtrn_u8(src.buf.reg[0], src.buf.reg[1]);
+  a[1] = vtrn_u8(src.buf.reg[2], src.buf.reg[3]);
+  a[2] = vtrn_u8(src.buf.reg[4], src.buf.reg[5]);
+  a[3] = vtrn_u8(src.buf.reg[6], src.buf.reg[7]);
+  uint16x4x2_t b[4];
+  b[0] = vtrn_u16(vreinterpret_u16_u8(a[0].val[0]),
+                  vreinterpret_u16_u8(a[1].val[0]));
+  b[1] = vtrn_u16(vreinterpret_u16_u8(a[0].val[1]),
+                  vreinterpret_u16_u8(a[1].val[1]));
+  b[2] = vtrn_u16(vreinterpret_u16_u8(a[2].val[0]),
+                  vreinterpret_u16_u8(a[3].val[0]));
+  b[3] = vtrn_u16(vreinterpret_u16_u8(a[2].val[1]),
+                  vreinterpret_u16_u8(a[3].val[1]));
+  uint32x2x2_t c[4];
+  c[0] = vtrn_u32(vreinterpret_u32_u16(b[0].val[0]),
+                  vreinterpret_u32_u16(b[2].val[0]));
+  c[1] = vtrn_u32(vreinterpret_u32_u16(b[1].val[0]),
+                  vreinterpret_u32_u16(b[3].val[0]));
+  c[2] = vtrn_u32(vreinterpret_u32_u16(b[0].val[1]),
+                  vreinterpret_u32_u16(b[2].val[1]));
+  c[3] = vtrn_u32(vreinterpret_u32_u16(b[1].val[1]),
+                  vreinterpret_u32_u16(b[3].val[1]));
+  RegBlockUint8<8, 8> result;
+  result.buf.reg[0] = vreinterpret_u8_u32(c[0].val[0]);
+  result.buf.reg[1] = vreinterpret_u8_u32(c[1].val[0]);
+  result.buf.reg[2] = vreinterpret_u8_u32(c[2].val[0]);
+  result.buf.reg[3] = vreinterpret_u8_u32(c[3].val[0]);
+  result.buf.reg[4] = vreinterpret_u8_u32(c[0].val[1]);
+  result.buf.reg[5] = vreinterpret_u8_u32(c[1].val[1]);
+  result.buf.reg[6] = vreinterpret_u8_u32(c[2].val[1]);
+  result.buf.reg[7] = vreinterpret_u8_u32(c[3].val[1]);
+  return result;
 }
 
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<8, 8>, DstType> {
+  static void Run(const RegBlockUint8<8, 8>& src, DstType* dst, int row,
+                  int col) {
+    const auto& block =
+        DstType::kOrder == MapOrder::ColMajor ? src : Transpose(src);
+    std::uint8_t* dst_ptr = dst->data(row, col);
+    int stride = dst->stride();
+    for (int i = 0; i < 8; i++) {
+      vst1_u8(dst_ptr + i * stride, block.buf.reg[i]);
+    }
+  }
+};
+
 }  // namespace gemmlowp
 
 #endif  // GEMMLOWP_INTERNAL_OUTPUT_NEON_H_
diff --git a/internal/output_sse.h b/internal/output_sse.h
new file mode 100644
index 0000000..5c06253
--- /dev/null
+++ b/internal/output_sse.h
@@ -0,0 +1,354 @@
+// Copyright 2015 Google Inc. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// output_sse.h: optimized SSE4.2 specializations of the templates in output.h.
+
+#ifndef GEMMLOWP_INTERNAL_OUTPUT_SSE_H_
+#define GEMMLOWP_INTERNAL_OUTPUT_SSE_H_
+
+#include "output.h"
+
+#include <smmintrin.h>
+
+namespace gemmlowp {
+
+template <>
+struct OutputStageEvalBufferImpl<OutputStageSaturatingCastToUint8,
+                                 RegBufferInt32<4>> {
+  typedef RegBufferInt32<4> InputType;
+  typedef RegBufferUint8<4> OutputType;
+
+  typedef OutputStageSaturatingCastToUint8 OutputStage;
+
+  OutputStageEvalBufferImpl(const OutputStage&) {}
+
+  OutputType Eval(InputType input) const {
+    OutputType output;
+    __m128i res_16 = _mm_packs_epi32(input.reg[0], input.reg[0]);
+    __m128i res_8 = _mm_packus_epi16(res_16, res_16);
+    output.reg[0] = _mm_cvtsi128_si32(res_8);
+    return output;
+  }
+};
+
+template <>
+struct OutputStageEvalBufferImpl<OutputStageSaturatingCastToUint8,
+                                 RegBufferInt32<8>> {
+  typedef RegBufferInt32<8> InputType;
+  typedef RegBufferUint8<8> OutputType;
+
+  typedef OutputStageSaturatingCastToUint8 OutputStage;
+
+  OutputStageEvalBufferImpl(const OutputStage&) {}
+
+  OutputType Eval(InputType input) const {
+    OutputType output;
+    __m128i res_16 = _mm_packs_epi32(input.reg[0], input.reg[1]);
+    __m128i res_8 = _mm_packus_epi16(res_16, res_16);
+    output.reg[0] = _mm_extract_epi32(res_8, 0);
+    output.reg[1] = _mm_extract_epi32(res_8, 1);
+    return output;
+  }
+};
+
+template <>
+struct OutputStageEvalBufferImpl<OutputStageSaturatingCastToUint8,
+                                 RegBufferInt32<16>> {
+  typedef RegBufferInt32<16> InputType;
+  typedef RegBufferUint8<16> OutputType;
+
+  typedef OutputStageSaturatingCastToUint8 OutputStage;
+
+  OutputStageEvalBufferImpl(const OutputStage&) {}
+
+  OutputType Eval(InputType input) const {
+    OutputType output;
+    __m128i res_16_0 = _mm_packs_epi32(input.reg[0], input.reg[1]);
+    __m128i res_16_1 = _mm_packs_epi32(input.reg[2], input.reg[3]);
+    output.reg[0] = _mm_packus_epi16(res_16_0, res_16_1);
+    return output;
+  }
+};
+
+template <>
+struct OutputStageEvalBufferImpl<OutputStageSaturatingCastToUint8,
+                                 RegBufferInt32<32>> {
+  typedef RegBufferInt32<32> InputType;
+  typedef RegBufferUint8<32> OutputType;
+
+  typedef OutputStageSaturatingCastToUint8 OutputStage;
+
+  OutputStageEvalBufferImpl(const OutputStage&) {}
+
+  OutputType Eval(InputType input) const {
+    OutputType output;
+    __m128i res_16_0 = _mm_packs_epi32(input.reg[0], input.reg[1]);
+    __m128i res_16_1 = _mm_packs_epi32(input.reg[2], input.reg[3]);
+    output.reg[0] = _mm_packus_epi16(res_16_0, res_16_1);
+    __m128i res_16_2 = _mm_packs_epi32(input.reg[4], input.reg[5]);
+    __m128i res_16_3 = _mm_packs_epi32(input.reg[6], input.reg[7]);
+    output.reg[1] = _mm_packus_epi16(res_16_2, res_16_3);
+    return output;
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<4, 1>, DstType> {
+  static void Run(const RegBlockInt32<4, 1>& src, DstType* dst, int row,
+                  int col) {
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      StoreInt32x4(dst->data(row, col), src.buf.reg[0]);
+    } else {
+      *dst->data(row + 0, col) = GetLane<0>(src.buf.reg[0]);
+      *dst->data(row + 1, col) = GetLane<1>(src.buf.reg[0]);
+      *dst->data(row + 2, col) = GetLane<2>(src.buf.reg[0]);
+      *dst->data(row + 3, col) = GetLane<3>(src.buf.reg[0]);
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<8, 1>, DstType> {
+  static void Run(const RegBlockInt32<8, 1>& src, DstType* dst, int row,
+                  int col) {
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      StoreInt32x4(dst->data(row, col), src.buf.reg[0]);
+      StoreInt32x4(dst->data(row + 4, col), src.buf.reg[1]);
+    } else {
+      *dst->data(row + 0, col) = GetLane<0>(src.buf.reg[0]);
+      *dst->data(row + 1, col) = GetLane<1>(src.buf.reg[0]);
+      *dst->data(row + 2, col) = GetLane<2>(src.buf.reg[0]);
+      *dst->data(row + 3, col) = GetLane<3>(src.buf.reg[0]);
+      *dst->data(row + 4, col) = GetLane<0>(src.buf.reg[1]);
+      *dst->data(row + 5, col) = GetLane<1>(src.buf.reg[1]);
+      *dst->data(row + 6, col) = GetLane<2>(src.buf.reg[1]);
+      *dst->data(row + 7, col) = GetLane<3>(src.buf.reg[1]);
+    }
+  }
+};
+
+inline RegBlockInt32<4, 4> Transpose(const RegBlockInt32<4, 4>& src) {
+  __m128i t0 = _mm_unpacklo_epi32(src.buf.reg[0], src.buf.reg[1]);
+  __m128i t1 = _mm_unpacklo_epi32(src.buf.reg[2], src.buf.reg[3]);
+  __m128i t2 = _mm_unpackhi_epi32(src.buf.reg[0], src.buf.reg[1]);
+  __m128i t3 = _mm_unpackhi_epi32(src.buf.reg[2], src.buf.reg[3]);
+
+  RegBlockInt32<4, 4> result;
+  result.buf.reg[0] = _mm_unpacklo_epi64(t0, t1);
+  result.buf.reg[1] = _mm_unpackhi_epi64(t0, t1);
+  result.buf.reg[2] = _mm_unpacklo_epi64(t2, t3);
+  result.buf.reg[3] = _mm_unpackhi_epi64(t2, t3);
+  return result;
+}
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<4, 4>, DstType> {
+  static void Run(const RegBlockInt32<4, 4>& src, DstType* dst, int row,
+                  int col) {
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      for (int i = 0; i < 4; i++) {
+        StoreInt32x4(dst->data(row, col + i), src.buf.reg[i]);
+      }
+    } else {
+      const auto transpose = Transpose(src);
+      for (int i = 0; i < 4; i++) {
+        StoreInt32x4(dst->data(row + i, col), transpose.buf.reg[i]);
+      }
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<8, 4>, DstType> {
+  static void Run(const RegBlockInt32<8, 4>& src, DstType* dst, int row,
+                  int col) {
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      for (int i = 0; i < 4; i++) {
+        StoreInt32x4(dst->data(row, col + i), src.buf.reg[2 * i]);
+        StoreInt32x4(dst->data(row + 4, col + i), src.buf.reg[2 * i + 1]);
+      }
+    } else {
+      RegBlockInt32<4, 4> top;
+      top.buf.reg[0] = src.buf.reg[0];
+      top.buf.reg[1] = src.buf.reg[2];
+      top.buf.reg[2] = src.buf.reg[4];
+      top.buf.reg[3] = src.buf.reg[6];
+      const auto transpose_top = Transpose(top);
+      for (int i = 0; i < 4; i++) {
+        StoreInt32x4(dst->data(row + i, col), transpose_top.buf.reg[i]);
+      }
+      RegBlockInt32<4, 4> bottom;
+      bottom.buf.reg[0] = src.buf.reg[1];
+      bottom.buf.reg[1] = src.buf.reg[3];
+      bottom.buf.reg[2] = src.buf.reg[5];
+      bottom.buf.reg[3] = src.buf.reg[7];
+      const auto transpose_bottom = Transpose(bottom);
+      for (int i = 0; i < 4; i++) {
+        StoreInt32x4(dst->data(row + 4 + i, col), transpose_bottom.buf.reg[i]);
+      }
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<8, 8>, DstType> {
+  static void Run(const RegBlockInt32<8, 8>& src, DstType* dst, int row,
+                  int col) {
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      for (int i = 0; i < 8; i++) {
+        StoreInt32x4(dst->data(row, col + i), src.buf.reg[2 * i]);
+        StoreInt32x4(dst->data(row + 4, col + i), src.buf.reg[2 * i + 1]);
+      }
+    } else {
+      RegBlockInt32<4, 4> top_left;
+      top_left.buf.reg[0] = src.buf.reg[0];
+      top_left.buf.reg[1] = src.buf.reg[2];
+      top_left.buf.reg[2] = src.buf.reg[4];
+      top_left.buf.reg[3] = src.buf.reg[6];
+      const auto transpose_top_left = Transpose(top_left);
+      for (int i = 0; i < 4; i++) {
+        StoreInt32x4(dst->data(row + i, col), transpose_top_left.buf.reg[i]);
+      }
+      RegBlockInt32<4, 4> bottom_left;
+      bottom_left.buf.reg[0] = src.buf.reg[1];
+      bottom_left.buf.reg[1] = src.buf.reg[3];
+      bottom_left.buf.reg[2] = src.buf.reg[5];
+      bottom_left.buf.reg[3] = src.buf.reg[7];
+      const auto transpose_bottom_left = Transpose(bottom_left);
+      for (int i = 0; i < 4; i++) {
+        StoreInt32x4(dst->data(row + 4 + i, col),
+                     transpose_bottom_left.buf.reg[i]);
+      }
+      RegBlockInt32<4, 4> top_right;
+      top_right.buf.reg[0] = src.buf.reg[8];
+      top_right.buf.reg[1] = src.buf.reg[10];
+      top_right.buf.reg[2] = src.buf.reg[12];
+      top_right.buf.reg[3] = src.buf.reg[14];
+      const auto transpose_top_right = Transpose(top_right);
+      for (int i = 0; i < 4; i++) {
+        StoreInt32x4(dst->data(row + i, col + 4),
+                     transpose_top_right.buf.reg[i]);
+      }
+      RegBlockInt32<4, 4> bottom_right;
+      bottom_right.buf.reg[0] = src.buf.reg[9];
+      bottom_right.buf.reg[1] = src.buf.reg[11];
+      bottom_right.buf.reg[2] = src.buf.reg[13];
+      bottom_right.buf.reg[3] = src.buf.reg[15];
+      const auto transpose_bottom_right = Transpose(bottom_right);
+      for (int i = 0; i < 4; i++) {
+        StoreInt32x4(dst->data(row + 4 + i, col + 4),
+                     transpose_bottom_right.buf.reg[i]);
+      }
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockInt32<1, 4>, DstType> {
+  static void Run(const RegBlockInt32<1, 4>& src, DstType* dst, int row,
+                  int col) {
+    if (DstType::kOrder == MapOrder::ColMajor) {
+      *dst->data(row, col + 0) = GetLane<0>(src.buf.reg[0]);
+      *dst->data(row, col + 1) = GetLane<1>(src.buf.reg[0]);
+      *dst->data(row, col + 2) = GetLane<2>(src.buf.reg[0]);
+      *dst->data(row, col + 3) = GetLane<3>(src.buf.reg[0]);
+    } else {
+      StoreInt32x4(dst->data(row, col), src.buf.reg[0]);
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<4, 1>, DstType> {
+  static void Run(const RegBlockUint8<4, 1>& src, DstType* dst, int row,
+                  int col) {
+    const std::uint32_t src_reg = src.buf.reg[0];
+    for (int i = 0; i < 4; i++) {
+      *dst->data(row + i, col) = (src_reg >> (8 * i));
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<8, 1>, DstType> {
+  static void Run(const RegBlockUint8<8, 1>& src, DstType* dst, int row,
+                  int col) {
+    for (int i = 0; i < 4; i++) {
+      *dst->data(row + i, col) = (src.buf.reg[0] >> (8 * i));
+    }
+    for (int i = 0; i < 4; i++) {
+      *dst->data(row + 4 + i, col) = (src.buf.reg[1] >> (8 * i));
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<1, 4>, DstType> {
+  static void Run(const RegBlockUint8<1, 4>& src, DstType* dst, int row,
+                  int col) {
+    for (int i = 0; i < 4; i++) {
+      *dst->data(row, col + i) = (src.buf.reg[0] >> (8 * i));
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<4, 4>, DstType> {
+  static void Run(const RegBlockUint8<4, 4>& src, DstType* dst, int row,
+                  int col) {
+    std::uint8_t buf[16];
+    StoreUint8x16(buf, src.buf.reg[0]);
+    for (int c = 0; c < 4; c++) {
+      for (int r = 0; r < 4; r++) {
+        *dst->data(row + r, col + c) = buf[r + 4 * c];
+      }
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<8, 4>, DstType> {
+  static void Run(const RegBlockUint8<8, 4>& src, DstType* dst, int row,
+                  int col) {
+    std::uint8_t buf[32];
+    StoreUint8x16(buf, src.buf.reg[0]);
+    StoreUint8x16(buf + 16, src.buf.reg[1]);
+    for (int c = 0; c < 4; c++) {
+      for (int r = 0; r < 8; r++) {
+        *dst->data(row + r, col + c) = buf[r + 8 * c];
+      }
+    }
+  }
+};
+
+template <typename DstType>
+struct StoreFinalOutputImpl<RegBlockUint8<8, 8>, DstType> {
+  static void Run(const RegBlockUint8<8, 8>& src, DstType* dst, int row,
+                  int col) {
+    std::uint8_t buf[64];
+    StoreUint8x16(buf, src.buf.reg[0]);
+    StoreUint8x16(buf + 16, src.buf.reg[1]);
+    StoreUint8x16(buf + 32, src.buf.reg[2]);
+    StoreUint8x16(buf + 48, src.buf.reg[3]);
+    for (int c = 0; c < 8; c++) {
+      for (int r = 0; r < 8; r++) {
+        *dst->data(row + r, col + c) = buf[r + 8 * c];
+      }
+    }
+  }
+};
+
+}  // namespace gemmlowp
+
+#endif  // GEMMLOWP_INTERNAL_OUTPUT_SSE_H_
diff --git a/internal/pack.h b/internal/pack.h
index 4531f79..3395396 100644
--- a/internal/pack.h
+++ b/internal/pack.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -29,7 +29,6 @@
 
 #include <cstring>
 
-#include "../public/bit_depth.h"
 #include "allocator.h"
 #include "block_params.h"
 #include "common.h"
@@ -51,8 +50,7 @@
 
   PackedSideBlock(Side side, Allocator* allocator,
                   const BlockParams& block_params)
-      : allocator_(allocator),
-        pos_(0) {
+      : allocator_(allocator), pos_(0) {
     GetSideBlockParams(side, &params_, block_params);
     data_handle_ =
         allocator_->Reserve<std::uint8_t>(params_.l2_width * params_.l2_depth);
@@ -189,94 +187,6 @@
   int width_, depth_, stride_;
 };
 
-template <RoundingMode tRoundingMode>
-class ScalarRoundingOffsetGenerator {
- public:
-  std::uint8_t get() {
-    assert(false);  // This generic path should never be called.
-    return 0;
-  }
-};
-
-// A RoundingOffsetGenerator for rounding-to-nearest, always returning
-// the midpoint value 127.
-template <>
-class ScalarRoundingOffsetGenerator<RoundingMode::Nearest> {
- public:
-  std::uint8_t get() { return 127; }
-};
-
-// A RoundingOffsetGenerator based on a 8-bit Xorshift.
-// This gives good results as Xorshift naturally generates
-// uniform random *nonzero* bytes i.e. 255 different values,
-// so it only remains for us to subtract one.
-template <>
-class ScalarRoundingOffsetGenerator<RoundingMode::ProbabilisticXorshift> {
- public:
-  ScalarRoundingOffsetGenerator() { x_ = 128; }
-
-  std::uint8_t get() {
-    std::uint8_t result = x_ - 1;
-    // Xorshift8(7,5,3)
-    x_ ^= x_ << 7;
-    x_ ^= x_ >> 5;
-    x_ ^= x_ << 3;
-    return result;
-  }
-
- private:
-  // State
-  std::uint8_t x_;
-};
-
-// A RoundingOffsetGenerator based on an 8-bit add/mod
-// low-discrepancy sequence.  See less-than-8-bit.txt for
-// an explanation (the constant 97 is important - it must
-// be both relatively prime to 255, in order for the sequence
-// to be full-period, and c/255 should be close to 0.38 to
-// obtain low discrepancy).  Uses a small bit hack to avoid
-// expensive % operations.
-template <>
-class ScalarRoundingOffsetGenerator<RoundingMode::ProbabilisticAddmod> {
-  static const uint8_t AddConst = 97;
-
- public:
-  ScalarRoundingOffsetGenerator() { x_ = 1; }  // Start must be non-zero
-
-  std::uint8_t get() {
-    // The +'d boolean term causes the increment to skip over 255,
-    // (recalling that 255+1 = 256 = 0 for an 8 bit uint),
-    // thus implementing %255
-    x_ += (AddConst + (x_ >= (255 - AddConst)));
-    return x_;
-  }
-
- private:
-  // State
-  std::uint8_t x_;
-};
-
-// Requantizes a source uint8 value in [0..255] range
-// to the range specified by BitDepth, [0..((2^bits)-1)].
-// Bias must be avoided. Currently this is achieved
-// by probabilistic rounding.
-template <typename QuantizationParams>
-std::uint8_t Requantize(
-    std::uint8_t raw_src_val,
-    ScalarRoundingOffsetGenerator<QuantizationParams::kRoundingMode>*
-        rounding_offset_generator) {
-  static const int kBits = QuantizationParams::BitDepth::kBits;
-  static const std::uint8_t kMaxVal = (1 << kBits) - 1;
-
-  if (kBits == 8) {
-    return raw_src_val;
-  }
-
-  std::uint16_t scaled = static_cast<std::uint16_t>(raw_src_val) * kMaxVal;
-  std::uint8_t rounding_offset = rounding_offset_generator->get();
-  return (scaled + rounding_offset) / 255;
-}
-
 // A PackingRegisterBlock is a small fixed-size block of a matrix being
 // packed. This class is the generic non-optimized implementation,
 // it is inherited by the generic implementation of PackingRegisterBlock,
@@ -293,21 +203,20 @@
 //   2. Packing a complete block into the destination, see Pack. This is the
 //      most critical part, so it's convenient that unaligned boundaries have
 //      already been handled in step 1.
-template <typename QuantizationParams, typename SrcMapType,
-          typename PackedSideBlock>
+template <typename SrcMapType, typename PackedSideBlock>
 class PackingRegisterBlockBase {
  public:
   typedef typename PackedSideBlock::KernelSideFormat KernelSideFormat;
   typedef typename KernelSideFormat::Cell CellFormat;
+  typedef typename KernelSideFormat::Scalar KernelScalar;
   static const int kCells = KernelSideFormat::kCells;
   static const int kCellWidth = CellFormat::kWidth;
   static const int kKernelWidth = CellFormat::kWidth * kCells;
   static const int kCellDepth = CellFormat::kDepth;
   static const int kCellSize = CellFormat::kSize;
   static const SideMapOrder kSrcOrder = SrcMapType::kOrder;
-
-  typedef ScalarRoundingOffsetGenerator<QuantizationParams::kRoundingMode>
-      RoundingOffsetGenerator;
+  static const int kZeroPointInputValue =
+      ZeroPointInputValue<KernelScalar>::kValue;
 
   PackingRegisterBlockBase() : complete_src_(nullptr, 0, 0, 0) {}
 
@@ -329,7 +238,7 @@
   // Copies an incomplete block of source data into a local temporary
   // complete block by zero-extending it.
   void MakeCompleteSrc(const SrcMapType& src) {
-    memset(buf_, 0, kKernelWidth * kRegisterSize);
+    memset(buf_, kZeroPointInputValue, kKernelWidth * kRegisterSize);
     if (kSrcOrder == SideMapOrder::WidthMajor) {
       for (int w = 0; w < src.width(); w++) {
         memcpy(buf_ + w * kRegisterSize, src.data(w, 0), src.depth());
@@ -345,8 +254,7 @@
   // Packs a complete block into the destination. This is the most
   // critical part and the part that we most typically want to
   // override in architecture-specific optimized specializations.
-  void Pack(PackedSideBlock* dst, int start_width,
-            RoundingOffsetGenerator* rounding_offset_generator) {
+  void Pack(PackedSideBlock* dst, int start_width) {
     std::uint8_t* dst_ptr = dst->current_data();
     for (int cell_start_depth = 0; cell_start_depth < kRegisterSize;
          cell_start_depth += kCellDepth) {
@@ -360,11 +268,12 @@
         for (int w = 0; w < kCellWidth; w++) {
           std::int32_t sum = 0;
           for (int d = 0; d < kCellDepth; d++) {
-            const std::uint8_t raw_src_val = src_cell_map(w, d);
-            const std::uint8_t requantized = Requantize<QuantizationParams>(
-                raw_src_val, rounding_offset_generator);
-            dst_ptr[OffsetIntoCell<CellFormat>(w, d)] = requantized;
-            sum += requantized;
+            const std::uint8_t src_val = src_cell_map(w, d);
+            const std::int16_t kernel_val_unwrapped =
+                src_val - kZeroPointInputValue;
+            const std::uint8_t kernel_val_uint8 = kernel_val_unwrapped;
+            dst_ptr[OffsetIntoCell<CellFormat>(w, d)] = kernel_val_uint8;
+            sum += kernel_val_unwrapped;
           }
           cell_sums_of_each_slice_ptr[w] += sum;
         }
@@ -375,15 +284,12 @@
   }
 };
 
-template <typename QuantizationParams, typename SrcMapType,
-          typename PackedSideBlock>
+template <typename SrcMapType, typename PackedSideBlock>
 class PackingRegisterBlock
-    : public PackingRegisterBlockBase<QuantizationParams, SrcMapType,
-                                      PackedSideBlock> {};
+    : public PackingRegisterBlockBase<SrcMapType, PackedSideBlock> {};
 
 // Large-scale implementation of packing.
-template <typename QuantizationParams, typename SrcMapType,
-          typename PackedSideBlock>
+template <typename SrcMapType, typename PackedSideBlock>
 class PackSideBlockImpl {
  public:
   typedef typename PackedSideBlock::KernelSideFormat KernelSideFormat;
@@ -393,10 +299,8 @@
   static const int kKernelWidth = CellFormat::kWidth * kCells;
   static const int kCellDepth = CellFormat::kDepth;
 
-  typedef PackingRegisterBlock<QuantizationParams, SrcMapType, PackedSideBlock>
+  typedef PackingRegisterBlock<SrcMapType, PackedSideBlock>
       PackingRegisterBlockType;
-  typedef typename PackingRegisterBlockType::RoundingOffsetGenerator
-      RoundingOffsetGenerator;
 
   PackSideBlockImpl(PackedSideBlock* packed_side_block,
                     const SrcMapType& src_map)
@@ -462,14 +366,14 @@
         for (int d = 0; d < register_aligned_depth; d += kRegisterSize) {
           b.UseCompleteSrcInPlace(src_map_.block(start_width, start_depth + d,
                                                  width, kRegisterSize));
-          b.Pack(packed_side_block_, start_width, &rounding_offset_generator_);
+          b.Pack(packed_side_block_, start_width);
         }
       }
       if (register_aligned_depth < depth) {
         b.MakeCompleteSrc(
             src_map_.block(start_width, start_depth + register_aligned_depth,
                            width, depth - register_aligned_depth));
-        b.Pack(packed_side_block_, start_width, &rounding_offset_generator_);
+        b.Pack(packed_side_block_, start_width);
       }
     } else {
       assert(width < kKernelWidth);
@@ -477,7 +381,7 @@
         const int ds = std::min(+kRegisterSize, depth - d);
         b.MakeCompleteSrc(
             src_map_.block(start_width, start_depth + d, width, ds));
-        b.Pack(packed_side_block_, start_width, &rounding_offset_generator_);
+        b.Pack(packed_side_block_, start_width);
       }
     }
   }
@@ -488,24 +392,10 @@
   // A map on the block of the original matrix block being packed,
   // i.e. the 'source'.
   const SrcMapType& src_map_;
-
-  // Used for requantization in the less-than-8-bit case.
-  // Otherwise unused.
-  RoundingOffsetGenerator rounding_offset_generator_;
-};
-
-// Quantization parameters for the side (LHS or RHS) being packed,
-// with the rounding strategy having been already resolved to a specific
-// rounding mode.
-template <typename tBitDepth, RoundingMode tRoundingMode>
-struct QuantizationParams {
-  typedef tBitDepth BitDepth;
-  static const RoundingMode kRoundingMode = tRoundingMode;
 };
 
 // Packs a block of the input LHS matrix, into a PackedSideBlock
-template <typename BitDepthParams, typename PackedSideBlock,
-          typename MatrixMapType>
+template <typename PackedSideBlock, typename MatrixMapType>
 void PackLhs(PackedSideBlock* dst, const MatrixMapType& src) {
   ScopedProfilingLabel label("pack LHS");
   static const SideMapOrder kSideMapOrder =
@@ -514,29 +404,13 @@
   typedef typename MatrixMapType::Scalar Scalar;
   typedef SideMap<Scalar, kSideMapOrder> SideMapType;
   SideMapType src_side_map(src.data(), src.rows(), src.cols(), src.stride());
-  typedef typename BitDepthParams::LhsBitDepth BitDepth;
-  typedef typename BitDepthParams::RoundingStrategy RoundingStrategy;
-  const int accumulation_depth = src_side_map.depth();
-  if (accumulation_depth < RoundingStrategy::kRoundingModeSizeThreshold) {
-    typedef QuantizationParams<BitDepth,
-                               RoundingStrategy::kRoundingModeForSmallSizes>
-        QParams;
-    typedef PackSideBlockImpl<QParams, SideMapType, PackedSideBlock> ImplType;
-    ImplType impl(dst, src_side_map);
-    impl.PackL2();
-  } else {
-    typedef QuantizationParams<BitDepth,
-                               RoundingStrategy::kRoundingModeForLargeSizes>
-        QParams;
-    typedef PackSideBlockImpl<QParams, SideMapType, PackedSideBlock> ImplType;
-    ImplType impl(dst, src_side_map);
-    impl.PackL2();
-  }
+  typedef PackSideBlockImpl<SideMapType, PackedSideBlock> ImplType;
+  ImplType impl(dst, src_side_map);
+  impl.PackL2();
 }
 
 // Packs a block of the input RHS matrix, into a PackedSideBlock
-template <typename BitDepthParams, typename PackedSideBlock,
-          typename MatrixMapType>
+template <typename PackedSideBlock, typename MatrixMapType>
 void PackRhs(PackedSideBlock* dst, const MatrixMapType& src) {
   ScopedProfilingLabel label("pack RHS");
   static const SideMapOrder kSideMapOrder =
@@ -545,24 +419,9 @@
   typedef typename MatrixMapType::Scalar Scalar;
   typedef SideMap<Scalar, kSideMapOrder> SideMapType;
   SideMapType src_side_map(src.data(), src.cols(), src.rows(), src.stride());
-  typedef typename BitDepthParams::RhsBitDepth BitDepth;
-  typedef typename BitDepthParams::RoundingStrategy RoundingStrategy;
-  const int accumulation_depth = src_side_map.depth();
-  if (accumulation_depth < RoundingStrategy::kRoundingModeSizeThreshold) {
-    typedef QuantizationParams<BitDepth,
-                               RoundingStrategy::kRoundingModeForSmallSizes>
-        QParams;
-    typedef PackSideBlockImpl<QParams, SideMapType, PackedSideBlock> ImplType;
-    ImplType impl(dst, src_side_map);
-    impl.PackL2();
-  } else {
-    typedef QuantizationParams<BitDepth,
-                               RoundingStrategy::kRoundingModeForLargeSizes>
-        QParams;
-    typedef PackSideBlockImpl<QParams, SideMapType, PackedSideBlock> ImplType;
-    ImplType impl(dst, src_side_map);
-    impl.PackL2();
-  }
+  typedef PackSideBlockImpl<SideMapType, PackedSideBlock> ImplType;
+  ImplType impl(dst, src_side_map);
+  impl.PackL2();
 }
 
 }  // namespace gemmlowp
@@ -570,7 +429,7 @@
 #ifdef GEMMLOWP_NEON
 #include "pack_neon.h"
 #elif defined(GEMMLOWP_SSE4)
-#include "pack_SSE.h"
+#include "pack_sse.h"
 #endif
 
 #endif  // GEMMLOWP_INTERNAL_PACK_H_
diff --git a/internal/pack_neon.h b/internal/pack_neon.h
index 4936b49..e212d07 100644
--- a/internal/pack_neon.h
+++ b/internal/pack_neon.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -23,151 +23,19 @@
 
 namespace gemmlowp {
 
-template <RoundingMode tRoundingMode>
-class NEONRoundingOffsetGenerator {
- public:
-  uint8x16_t get() {
-    assert(false);  // This generic path should never be called.
-    return vdupq_n_u8(0);
-  }
-};
-
-// A RoundingOffsetGenerator for rounding-to-nearest, always returning
-// the midpoint value 127.
-template <>
-class NEONRoundingOffsetGenerator<RoundingMode::Nearest> {
- public:
-  uint8x16_t get() { return vdupq_n_u8(127); }
-};
-
-// Variant of NEONRoundingOffsetGenerator that produces
-// random NEON 128-bit vectors using a 8-bit Xorshift.
-template <>
-class NEONRoundingOffsetGenerator<RoundingMode::ProbabilisticXorshift> {
- public:
-  NEONRoundingOffsetGenerator() {
-    uint8_t s = 128;
-    std::uint8_t a[16];
-    for (int i = 0; i < 16; i++) {
-      a[i] = s;
-      // Xorshift8(7,7,1). Very important to choose a different
-      // xorshift than we do in get(), otherwise lanes would contain
-      // the same values!
-      s ^= s << 7;
-      s ^= s >> 7;
-      s ^= s << 1;
-    }
-    x_ = vld1q_u8(a);
-  }
-
-  uint8x16_t get() {
-    // Xorshift produces values in [1..255], we want [0..254].
-    uint8x16_t result = vsubq_u8(x_, vdupq_n_u8(1));
-    // Xorshift8(7,5,3)
-    x_ = veorq_u8(x_, vshlq_n_u8(x_, 7));
-    x_ = veorq_u8(x_, vshrq_n_u8(x_, 5));
-    x_ = veorq_u8(x_, vshlq_n_u8(x_, 3));
-    return result;
-  }
-
- private:
-  // State
-  uint8x16_t x_;
-};
-
-// Variant of NEONRoundingOffsetGenerator that produces
-// rounding vectors using an 8-bit add/mod low-discrepancy sequence.
-template <>
-class NEONRoundingOffsetGenerator<RoundingMode::ProbabilisticAddmod> {
- public:
-  NEONRoundingOffsetGenerator() {
-    uint8_t s = 128;
-    std::uint8_t a[16];
-    // The initial offset is set by offsetting each lane to one
-    // more iteration of the sequence (s0...s15)  Then, upon iteration,
-    // each lane moves ahead by 16.
-    for (int i = 0; i < 16; i++) {
-      a[i] = s;
-      s += (97 + (s >= 158));
-    }
-    x_ = vld1q_u8(a);
-  }
-
-  uint8x16_t get() {
-    // Get moves the lane ahead by 16 iterations of the sequence
-    // x_ = (x + (16*97)) % 255.  (16*97)%255 = 22.  255-22=233,
-    // so x_ += (22 + (x >= 233)).
-    // There's an excessively opaque bit hack here:
-    // A "true" compare on NEON produces an all-1s result (0xff).
-    // So instead of adding in the comparison result, we subtract it
-    // to get the same effect as adding 1.
-    uint8x16_t extra_one = vcgeq_u8(x_, vdupq_n_u8(233));
-    x_ = vaddq_u8(x_, vdupq_n_u8(22));
-    x_ = vsubq_u8(x_, extra_one);
-    return x_;
-  }
-
- private:
-  // State
-  uint8x16_t x_;
-};
-
-// Requantizes source uint8 values in [0..255] range
-// to the range specified by BitDepth, [0..((2^bits)-1)].
-// Bias must be avoided. Currently this is achieved
-// by probabilistic rounding.
-template <typename QuantizationParams>
-uint8x16_t Requantize(
-    uint8x16_t raw_src_data,
-    NEONRoundingOffsetGenerator<QuantizationParams::kRoundingMode>*
-        rounding_offset_generator) {
-  static const int kBits = QuantizationParams::BitDepth::kBits;
-  static const std::uint8_t kMaxVal = (1 << kBits) - 1;
-
-  if (kBits == 8) {
-    return raw_src_data;
-  }
-
-  uint8x16_t rounding_offset = rounding_offset_generator->get();
-
-  // Compute:
-  //   x = maxval * src + rounding_offset
-  uint16x8_t x[2];
-  const uint8x8_t maxval_dup = vdup_n_u8(kMaxVal);
-  x[0] = vmlal_u8(vmovl_u8(vget_low_u8(rounding_offset)), maxval_dup,
-                  vget_low_u8(raw_src_data));
-  x[1] = vmlal_u8(vmovl_u8(vget_high_u8(rounding_offset)), maxval_dup,
-                  vget_high_u8(raw_src_data));
-
-  // Divide by 255 (truncating).
-  //
-  // Here we use the following formula, valid for all integers y in 0..65534
-  // (which is more than we need since we've already early-returned
-  // if kBits==8).
-  //
-  //     y/255 = (y + 1 + (y >> 8)) >> 8.
-  uint8x8_t result[2];
-  for (int i = 0; i < 2; i++) {
-    result[i] = vshrn_n_u16(
-        vaddq_u16(vaddq_u16(x[i], vdupq_n_u16(1)), vshrq_n_u16(x[i], 8)), 8);
-  }
-
-  return vcombine_u8(result[0], result[1]);
-}
-
 typedef SideMap<const std::uint8_t, SideMapOrder::WidthMajor>
     WidthMajorUint8SideMap;
 
 template <int Cells>
 using DepthMajorSideFormatNCells4x2 = KernelSideFormat<CellFormat<4, 2>, Cells>;
 
-template <typename QuantizationParams, int Cells>
+template <int Cells>
 class PackingRegisterBlock<
-    QuantizationParams, WidthMajorUint8SideMap,
-    PackedSideBlock<DepthMajorSideFormatNCells4x2<Cells> > >
+    WidthMajorUint8SideMap,
+    PackedSideBlock<DepthMajorSideFormatNCells4x2<Cells>>>
     : public PackingRegisterBlockBase<
-          QuantizationParams, WidthMajorUint8SideMap,
-          PackedSideBlock<DepthMajorSideFormatNCells4x2<Cells> > > {
+          WidthMajorUint8SideMap,
+          PackedSideBlock<DepthMajorSideFormatNCells4x2<Cells>>> {
  public:
   typedef DepthMajorSideFormatNCells4x2<Cells> KernelSideFormat;
   typedef typename KernelSideFormat::Cell CellFormat;
@@ -177,19 +45,14 @@
   static const int kCellDepth = CellFormat::kDepth;
   static const int kCellSize = CellFormat::kSize;
 
-  typedef NEONRoundingOffsetGenerator<QuantizationParams::kRoundingMode>
-      RoundingOffsetGenerator;
-
-  void Pack(PackedSideBlock<KernelSideFormat>* dst, int start_width,
-            RoundingOffsetGenerator* rounding_offset_generator) {
+  void Pack(PackedSideBlock<KernelSideFormat>* dst, int start_width) {
     std::uint8_t* dst_ptr = dst->current_data();
     const std::uint8_t* const src_ptr = this->complete_src_.data();
     const int stride = this->complete_src_.stride();
-    // Load and requantize source WidthMajor data
+    // Load source WidthMajor data
     uint8x16_t src_lines[4 * kCells];
     for (int i = 0; i < 4 * kCells; i++) {
-      src_lines[i] = Requantize<QuantizationParams>(
-          vld1q_u8(src_ptr + i * stride), rounding_offset_generator);
+      src_lines[i] = vld1q_u8(src_ptr + i * stride);
     }
     // Reorder the data within registers to make DepthMajor 4x2 cells
     uint8x16x2_t src_lines_intertwined_2x[2 * kCells];
@@ -267,13 +130,13 @@
 using WidthMajorSideFormatNCells4x2 =
     KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, Cells>;
 
-template <typename QuantizationParams, int Cells>
+template <int Cells>
 class PackingRegisterBlock<
-    QuantizationParams, WidthMajorUint8SideMap,
-    PackedSideBlock<WidthMajorSideFormatNCells4x2<Cells> > >
+    WidthMajorUint8SideMap,
+    PackedSideBlock<WidthMajorSideFormatNCells4x2<Cells>>>
     : public PackingRegisterBlockBase<
-          QuantizationParams, WidthMajorUint8SideMap,
-          PackedSideBlock<WidthMajorSideFormatNCells4x2<Cells> > > {
+          WidthMajorUint8SideMap,
+          PackedSideBlock<WidthMajorSideFormatNCells4x2<Cells>>> {
  public:
   typedef WidthMajorSideFormatNCells4x2<Cells> KernelSideFormat;
   typedef typename KernelSideFormat::Cell CellFormat;
@@ -283,15 +146,11 @@
   static const int kCellDepth = CellFormat::kDepth;
   static const int kCellSize = CellFormat::kSize;
 
-  typedef NEONRoundingOffsetGenerator<QuantizationParams::kRoundingMode>
-      RoundingOffsetGenerator;
-
-  void Pack(PackedSideBlock<KernelSideFormat>* dst, int start_width,
-            RoundingOffsetGenerator* rounding_offset_generator) {
+  void Pack(PackedSideBlock<KernelSideFormat>* dst, int start_width) {
     std::uint8_t* dst_ptr = dst->current_data();
     const std::uint8_t* src_ptr = this->complete_src_.data();
     const int stride = this->complete_src_.stride();
-    // Load and requantize source WidthMajor data
+    // Load source WidthMajor data
     uint16x8_t src_lines[kCells * 4];
     for (int i = 0; i < kCells; i++) {
 // This packing path is used with our current
@@ -299,9 +158,8 @@
 // results in substantially faster code (thanks to better
 // register allocation) on Nexus 5.
 
-#define GEMMLOWP_UNROLLED_LOOP_ITER(k)                                        \
-  src_lines[4 * i + k] = vreinterpretq_u16_u8(Requantize<QuantizationParams>( \
-      vld1q_u8(src_ptr), rounding_offset_generator));                         \
+#define GEMMLOWP_UNROLLED_LOOP_ITER(k)                            \
+  src_lines[4 * i + k] = vreinterpretq_u16_u8(vld1q_u8(src_ptr)); \
   src_ptr += stride;
 
       GEMMLOWP_UNROLLED_LOOP_ITER(0)
@@ -385,6 +243,78 @@
   }
 };
 
+#ifdef GEMMLOWP_NEON_32
+inline int16x8_t vpaddq_s16(int16x8_t a, int16x8_t b) {
+  const int16x4_t c = vpadd_s16(vget_low_s16(a), vget_high_s16(a));
+  const int16x4_t d = vpadd_s16(vget_low_s16(b), vget_high_s16(b));
+  return vcombine_s16(c, d);
+}
+#endif
+
+template <int Width>
+using Int8FastKernelFormat =
+    KernelSideFormatInt8<CellFormat<Width, 16, CellOrder::WidthMajor>, 1>;
+
+template <int Width>
+class PackingRegisterBlock<WidthMajorUint8SideMap,
+                           PackedSideBlock<Int8FastKernelFormat<Width>>>
+    : public PackingRegisterBlockBase<
+          WidthMajorUint8SideMap,
+          PackedSideBlock<Int8FastKernelFormat<Width>>> {
+ public:
+  static_assert(Width == 2 || Width == 4, "");
+  typedef Int8FastKernelFormat<Width> KernelSideFormat;
+  typedef typename KernelSideFormat::Cell CellFormat;
+  static const int kCells = KernelSideFormat::kCells;
+  static const int kCellWidth = CellFormat::kWidth;
+  static const int kKernelWidth = CellFormat::kWidth * kCells;
+  static const int kCellDepth = CellFormat::kDepth;
+  static const int kCellSize = CellFormat::kSize;
+
+  void Pack(PackedSideBlock<KernelSideFormat>* dst, int start_width) {
+    std::int32_t* sums_ptr = dst->sums_of_each_slice() + start_width;
+    std::uint8_t* dst_ptr = dst->current_data();
+    const std::uint8_t* const src_ptr = this->complete_src_.data();
+    const int stride = this->complete_src_.stride();
+    // Load source WidthMajor data
+    uint8x16_t src_lines[Width];
+    for (int i = 0; i < Width; i++) {
+      src_lines[i] = vld1q_u8(src_ptr + i * stride);
+    }
+    const uint8x16_t sign_bit_dup = vdupq_n_u8(0x80);
+    for (int i = 0; i < Width; i++) {
+      src_lines[i] = veorq_u8(src_lines[i], sign_bit_dup);
+    }
+    for (int i = 0; i < Width; i++) {
+      vst1q_u8(dst_ptr + 16 * i, src_lines[i]);
+    }
+    int16x8_t sums2[Width];
+    for (int i = 0; i < Width; i++) {
+      const int8x8_t lo = vreinterpret_s8_u8(vget_low_u8(src_lines[i]));
+      const int8x8_t hi = vreinterpret_s8_u8(vget_high_u8(src_lines[i]));
+      sums2[i] = vaddl_s8(lo, hi);
+    }
+    int16x8_t sums4[Width / 2];
+    for (int i = 0; i < Width / 2; i++) {
+      sums4[i] = vpaddq_s16(sums2[2 * i], sums2[2 * i + 1]);
+    }
+    if (Width == 4) {
+      int32x4_t sum = vld1q_s32(sums_ptr);
+      int16x8_t sums8 = vpaddq_s16(sums4[0], sums4[1]);
+      sum = vpadalq_s16(sum, sums8);
+      vst1q_s32(sums_ptr, sum);
+    } else {
+      assert(Width == 2);
+      int32x2_t sum = vld1_s32(sums_ptr);
+      int16x4_t sums8 =
+          vpadd_s16(vget_low_s16(sums4[0]), vget_high_s16(sums4[0]));
+      sum = vpadal_s16(sum, sums8);
+      vst1_s32(sums_ptr, sum);
+    }
+    dst->seek_forward_n_cells(1);
+  }
+};
+
 }  // namespace gemmlowp
 
 #endif  // GEMMLOWP_INTERNAL_PACK_NEON_H_
diff --git a/internal/pack_sse.h b/internal/pack_sse.h
new file mode 100644
index 0000000..52163c4
--- /dev/null
+++ b/internal/pack_sse.h
@@ -0,0 +1,128 @@
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// pack_SSE.h: optimized SSE specializations of the templates in pack.h.
+
+#ifndef GEMMLOWP_INTERNAL_PACK_SSE_H_
+#define GEMMLOWP_INTERNAL_PACK_SSE_H_
+
+#include <smmintrin.h>
+#include "pack.h"
+
+namespace gemmlowp {
+
+// TODO: Add DepthMajorUint8SideMap
+
+typedef SideMap<const std::uint8_t, SideMapOrder::WidthMajor>
+    WidthMajorUint8SideMap;
+
+template <int Cells>
+using WidthMajorSideFormatNCells4x2 =
+    KernelSideFormat<CellFormat<4, 2, CellOrder::WidthMajor>, Cells>;
+
+template <int Cells>
+class PackingRegisterBlock<
+    WidthMajorUint8SideMap,
+    PackedSideBlock<WidthMajorSideFormatNCells4x2<Cells> > >
+    : public PackingRegisterBlockBase<
+          WidthMajorUint8SideMap,
+          PackedSideBlock<WidthMajorSideFormatNCells4x2<Cells> > > {
+ public:
+  typedef WidthMajorSideFormatNCells4x2<Cells> KernelSideFormat;
+  typedef typename KernelSideFormat::Cell CellFormat;
+  static const int kCells = KernelSideFormat::kCells;
+  static const int kCellWidth = CellFormat::kWidth;
+  static const int kKernelWidth = CellFormat::kWidth * kCells;
+  static const int kCellDepth = CellFormat::kDepth;
+  static const int kCellSize = CellFormat::kSize;
+
+  void Pack(PackedSideBlock<KernelSideFormat>* dst, int start_width) {
+    std::uint8_t* dst_ptr = dst->current_data();
+    const int width_stride = this->complete_src_.width_stride();
+    int depth_step = 8;
+
+    __m128i one = _mm_set1_epi16(1);
+    for (int cell_start_depth = 0; cell_start_depth < kRegisterSize;
+         cell_start_depth += depth_step) {
+      for (int cell_start_width = 0; cell_start_width < kKernelWidth;
+           cell_start_width += kCellWidth) {
+        std::int32_t* cell_sums_of_each_slice_ptr =
+            dst->sums_of_each_slice() + start_width + cell_start_width;
+        const std::uint8_t* src_data =
+            this->complete_src_.data(cell_start_width, cell_start_depth);
+
+        __m128i xmm1 =
+            _mm_loadl_epi64(reinterpret_cast<const __m128i*>(&src_data[0]));
+        __m128i xmm2 = _mm_loadl_epi64(
+            reinterpret_cast<const __m128i*>(&src_data[1 * width_stride]));
+        __m128i xmm3 = _mm_loadl_epi64(
+            reinterpret_cast<const __m128i*>(&src_data[2 * width_stride]));
+        __m128i xmm4 = _mm_loadl_epi64(
+            reinterpret_cast<const __m128i*>(&src_data[3 * width_stride]));
+
+        __m128i xmm5 = _mm_unpacklo_epi16(xmm1, xmm2);
+        __m128i xmm8 = _mm_shuffle_epi32(xmm5, 0x31);
+
+        __m128i xmm6 = _mm_unpacklo_epi16(xmm3, xmm4);
+        __m128i xmm7 = _mm_shuffle_epi32(xmm6, 0x80);
+
+        __m128i xmm9 = _mm_blend_epi16(xmm5, xmm7, 0xcc);
+        __m128i xmm10 = _mm_blend_epi16(xmm8, xmm6, 0xcc);
+
+        _mm_storel_epi64(reinterpret_cast<__m128i*>(&dst_ptr[0]), xmm9);
+        _mm_storel_epi64(
+            reinterpret_cast<__m128i*>(&dst_ptr[kCellSize * kCells]), xmm10);
+
+        __m128i xmm11 = _mm_shuffle_epi32(xmm9, 0xee);
+        __m128i xmm12 = _mm_shuffle_epi32(xmm10, 0xee);
+
+        _mm_storel_epi64(
+            reinterpret_cast<__m128i*>(&dst_ptr[2 * kCellSize * kCells]),
+            xmm11);
+        _mm_storel_epi64(
+            reinterpret_cast<__m128i*>(&dst_ptr[3 * kCellSize * kCells]),
+            xmm12);
+
+        xmm1 = _mm_cvtepu8_epi16(xmm9);
+        xmm2 = _mm_madd_epi16(xmm1, one);
+        __m128i sums_of_each_slice_xmm = _mm_loadu_si128(
+            reinterpret_cast<const __m128i*>(&cell_sums_of_each_slice_ptr[0]));
+        sums_of_each_slice_xmm = _mm_add_epi32(sums_of_each_slice_xmm, xmm2);
+
+        xmm1 = _mm_cvtepu8_epi16(xmm10);
+        xmm2 = _mm_madd_epi16(xmm1, one);
+        sums_of_each_slice_xmm = _mm_add_epi32(sums_of_each_slice_xmm, xmm2);
+
+        xmm1 = _mm_cvtepu8_epi16(xmm11);
+        xmm2 = _mm_madd_epi16(xmm1, one);
+        sums_of_each_slice_xmm = _mm_add_epi32(sums_of_each_slice_xmm, xmm2);
+
+        xmm1 = _mm_cvtepu8_epi16(xmm12);
+        xmm2 = _mm_madd_epi16(xmm1, one);
+        sums_of_each_slice_xmm = _mm_add_epi32(sums_of_each_slice_xmm, xmm2);
+
+        _mm_storeu_si128(
+            reinterpret_cast<__m128i*>(&cell_sums_of_each_slice_ptr[0]),
+            sums_of_each_slice_xmm);
+        dst_ptr += kCellSize;
+      }
+      dst_ptr += 3 * kCellSize * kCells;
+    }
+    dst->seek_forward_n_cells(kCells * kRegisterSize / kCellDepth);
+  }
+};
+
+}  // namespace gemmlowp
+
+#endif  // GEMMLOWP_INTERNAL_PACK_SSE_H_
diff --git a/internal/simd_wrappers.h b/internal/simd_wrappers.h
new file mode 100644
index 0000000..e39eaf8
--- /dev/null
+++ b/internal/simd_wrappers.h
@@ -0,0 +1,508 @@
+// Copyright 2017 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// simd_wrappers.h: some inline functions wrapping SIMD intrinsics,
+// extending the set of such functions from fixedpoint.h.
+
+#ifndef GEMMLOWP_INTERNAL_SIMD_WRAPPERS_H_
+#define GEMMLOWP_INTERNAL_SIMD_WRAPPERS_H_
+
+#include <algorithm>
+#include <type_traits>
+#include "../fixedpoint/fixedpoint.h"
+
+namespace gemmlowp {
+
+template <typename ScalarType, int ScalarCount>
+struct RegisterType {
+  using Type = ScalarType;
+};
+
+inline std::int32_t Min(std::int32_t a, std::int32_t b) {
+  return std::min(a, b);
+}
+
+inline std::int32_t Max(std::int32_t a, std::int32_t b) {
+  return std::max(a, b);
+}
+
+inline void MulAdd(std::int32_t lhs, std::int32_t rhs, std::int32_t* acc) {
+  *acc += lhs * rhs;
+}
+
+template <typename tScalarType, int tScalarCount>
+struct RegisterBuffer {
+  using ScalarType = tScalarType;
+  static constexpr int kScalarCount = tScalarCount;
+  using RegisterType = typename RegisterType<ScalarType, kScalarCount>::Type;
+  static_assert((kScalarCount & (kScalarCount - 1)) == 0,
+                "kScalarCount must be a power of two");
+  static_assert(sizeof(RegisterType) % sizeof(ScalarType) == 0, "");
+  static constexpr int kRegisterLanes =
+      sizeof(RegisterType) / sizeof(ScalarType);
+  static constexpr int kRegisterCount =
+      (kScalarCount * sizeof(ScalarType) + sizeof(RegisterType) - 1) /
+      sizeof(RegisterType);
+
+  RegisterType reg[kRegisterCount];
+};
+
+template <typename tScalarType, int tRows, int tCols>
+struct RegisterBlock {
+  using ScalarType = tScalarType;
+  static constexpr int kRows = tRows;
+  static constexpr int kCols = tCols;
+  static constexpr int kScalarCount = kRows * kCols;
+  using BufferType = RegisterBuffer<ScalarType, kScalarCount>;
+  using RegisterType = typename BufferType::RegisterType;
+  static constexpr int kRegisterCount = BufferType::kRegisterCount;
+  static constexpr int kRegisterLanes = BufferType::kRegisterLanes;
+
+  BufferType buf;
+};
+
+template <typename RegisterBlockType>
+struct RegisterBlockAddImpl {
+  static RegisterBlockType Run(const RegisterBlockType& lhs,
+                               const RegisterBlockType& rhs) {
+    RegisterBlockType result;
+    for (int i = 0; i < RegisterBlockType::kRegisterCount; i++) {
+      result.buf.reg[i] = Add(lhs.buf.reg[i], rhs.buf.reg[i]);
+    }
+    return result;
+  }
+};
+
+template <typename RegisterBlockType>
+RegisterBlockType RegisterBlockAdd(const RegisterBlockType& lhs,
+                                   const RegisterBlockType& rhs) {
+  return RegisterBlockAddImpl<RegisterBlockType>::Run(lhs, rhs);
+}
+
+template <typename LhsType, typename RhsType>
+struct ShouldFlipLhsRhs {
+  static constexpr bool kValue =
+      (LhsType::kScalarCount < RhsType::kScalarCount) ||
+      (LhsType::kScalarCount == RhsType::kScalarCount &&
+       (LhsType::kRows < RhsType::kRows));
+};
+
+template <typename LhsType, typename RhsType,
+          bool Flip = ShouldFlipLhsRhs<LhsType, RhsType>::kValue>
+struct FlipLhsRhs {
+  using FlippedLhsType = LhsType;
+  using FlippedRhsType = RhsType;
+  static const FlippedLhsType& FlippedLhs(const LhsType& lhs,
+                                          const RhsType& rhs) {
+    return lhs;
+  }
+  static const FlippedRhsType& FlippedRhs(const LhsType& lhs,
+                                          const RhsType& rhs) {
+    return rhs;
+  }
+};
+
+template <typename LhsType, typename RhsType>
+struct FlipLhsRhs<LhsType, RhsType, true> {
+  using FlippedLhsType = RhsType;
+  using FlippedRhsType = LhsType;
+  static const FlippedLhsType& FlippedLhs(const LhsType& lhs,
+                                          const RhsType& rhs) {
+    return rhs;
+  }
+  static const FlippedRhsType& FlippedRhs(const LhsType& lhs,
+                                          const RhsType& rhs) {
+    return lhs;
+  }
+};
+
+template <typename Lhs, typename Rhs>
+struct BroadcastBinaryOpShape {
+  static constexpr int kRows =
+      Lhs::kRows > Rhs::kRows ? Lhs::kRows : Rhs::kRows;
+  static constexpr int kCols =
+      Lhs::kCols > Rhs::kCols ? Lhs::kCols : Rhs::kCols;
+};
+
+template <typename Lhs, typename Rhs>
+struct BroadcastBinaryOpRegisterBlock {
+  using Shape = BroadcastBinaryOpShape<Lhs, Rhs>;
+  using ScalarType = typename Lhs::ScalarType;
+  using Type = RegisterBlock<ScalarType, Shape::kRows, Shape::kCols>;
+};
+
+template <typename Lhs, typename Rhs>
+struct BroadcastAddImpl {
+  using ResultBlockType =
+      typename BroadcastBinaryOpRegisterBlock<Lhs, Rhs>::Type;
+  static ResultBlockType Run(const Lhs& lhs, const Rhs& rhs) {
+    ResultBlockType result;
+    static constexpr int Rows = ResultBlockType::kRows;
+    static constexpr int Cols = ResultBlockType::kCols;
+    static constexpr int LhsRows = Lhs::kRows;
+    static constexpr int LhsCols = Lhs::kCols;
+    static constexpr int RhsRows = Rhs::kRows;
+    static constexpr int RhsCols = Rhs::kCols;
+
+    static_assert(LhsRows == Rows || LhsRows == 1, "");
+    static_assert(RhsRows == Rows || RhsRows == 1, "");
+    static_assert(LhsCols == Cols || LhsCols == 1, "");
+    static_assert(RhsCols == Cols || RhsCols == 1, "");
+    static_assert(ResultBlockType::kRegisterLanes == 1,
+                  "This path is only for scalar values");
+    static_assert(Lhs::kRegisterLanes == 1,
+                  "This path is only for scalar values");
+    static_assert(Rhs::kRegisterLanes == 1,
+                  "This path is only for scalar values");
+
+    for (int c = 0; c < Cols; c++) {
+      const int lhs_c = LhsCols == Cols ? c : 0;
+      const int rhs_c = RhsCols == Cols ? c : 0;
+      for (int r = 0; r < Rows; r++) {
+        const int lhs_r = LhsRows == Rows ? r : 0;
+        const int rhs_r = RhsRows == Rows ? r : 0;
+        result.buf.reg[r + c * Rows] =
+            Add(lhs.buf.reg[lhs_r + lhs_c * LhsRows],
+                rhs.buf.reg[rhs_r + rhs_c * RhsRows]);
+      }
+    }
+    return result;
+  }
+};
+
+template <typename Lhs, typename Rhs>
+typename BroadcastBinaryOpRegisterBlock<Lhs, Rhs>::Type BroadcastAdd(
+    const Lhs& lhs, const Rhs& rhs) {
+  using Flip = FlipLhsRhs<Lhs, Rhs>;
+  return BroadcastAddImpl<
+      typename Flip::FlippedLhsType,
+      typename Flip::FlippedRhsType>::Run(Flip::FlippedLhs(lhs, rhs),
+                                          Flip::FlippedRhs(lhs, rhs));
+}
+
+template <typename Lhs, typename Rhs>
+struct BroadcastMulImpl {
+  using ResultBlockType =
+      typename BroadcastBinaryOpRegisterBlock<Lhs, Rhs>::Type;
+  static ResultBlockType Run(const Lhs& lhs, const Rhs& rhs) {
+    ResultBlockType result;
+    static constexpr int Rows = ResultBlockType::kRows;
+    static constexpr int Cols = ResultBlockType::kCols;
+    static constexpr int LhsRows = Lhs::kRows;
+    static constexpr int LhsCols = Lhs::kCols;
+    static constexpr int RhsRows = Rhs::kRows;
+    static constexpr int RhsCols = Rhs::kCols;
+    static_assert(ResultBlockType::kRegisterLanes == 1,
+                  "This path is only for scalar values");
+    static_assert(Lhs::kRegisterLanes == 1,
+                  "This path is only for scalar values");
+    static_assert(Rhs::kRegisterLanes == 1,
+                  "This path is only for scalar values");
+
+    static_assert(LhsRows == Rows || LhsRows == 1, "");
+    static_assert(RhsRows == Rows || RhsRows == 1, "");
+    static_assert(LhsCols == Cols || LhsCols == 1, "");
+    static_assert(RhsCols == Cols || RhsCols == 1, "");
+    for (int c = 0; c < Cols; c++) {
+      const int lhs_c = LhsCols == Cols ? c : 0;
+      const int rhs_c = RhsCols == Cols ? c : 0;
+      for (int r = 0; r < Rows; r++) {
+        const int lhs_r = LhsRows == Rows ? r : 0;
+        const int rhs_r = RhsRows == Rows ? r : 0;
+        result.buf.reg[r + c * Rows] =
+            Mul(lhs.buf.reg[lhs_r + lhs_c * LhsRows],
+                rhs.buf.reg[rhs_r + rhs_c * RhsRows]);
+      }
+    }
+    return result;
+  }
+};
+
+template <typename Lhs, typename Rhs>
+typename BroadcastBinaryOpRegisterBlock<Lhs, Rhs>::Type BroadcastMul(
+    const Lhs& lhs, const Rhs& rhs) {
+  using Flip = FlipLhsRhs<Lhs, Rhs>;
+  return BroadcastMulImpl<
+      typename Flip::FlippedLhsType,
+      typename Flip::FlippedRhsType>::Run(Flip::FlippedLhs(lhs, rhs),
+                                          Flip::FlippedRhs(lhs, rhs));
+}
+
+template <typename Lhs, typename Rhs, typename Acc>
+struct BroadcastMulAddImpl {
+  static void Run(const Lhs& lhs, const Rhs& rhs, Acc* acc) {
+    static constexpr int Rows = Acc::kRows;
+    static constexpr int Cols = Acc::kCols;
+    static constexpr int LhsRows = Lhs::kRows;
+    static constexpr int LhsCols = Lhs::kCols;
+    static constexpr int RhsRows = Rhs::kRows;
+    static constexpr int RhsCols = Rhs::kCols;
+    static_assert(Acc::kRegisterLanes == 1,
+                  "This path is only for scalar values");
+    static_assert(Lhs::kRegisterLanes == 1,
+                  "This path is only for scalar values");
+    static_assert(Rhs::kRegisterLanes == 1,
+                  "This path is only for scalar values");
+
+    static_assert(LhsRows == Rows || LhsRows == 1, "");
+    static_assert(RhsRows == Rows || RhsRows == 1, "");
+    static_assert(LhsCols == Cols || LhsCols == 1, "");
+    static_assert(RhsCols == Cols || RhsCols == 1, "");
+    for (int c = 0; c < Cols; c++) {
+      const int lhs_c = LhsCols == Cols ? c : 0;
+      const int rhs_c = RhsCols == Cols ? c : 0;
+      for (int r = 0; r < Rows; r++) {
+        const int lhs_r = LhsRows == Rows ? r : 0;
+        const int rhs_r = RhsRows == Rows ? r : 0;
+        MulAdd(lhs.buf.reg[lhs_r + lhs_c * LhsRows],
+               rhs.buf.reg[rhs_r + rhs_c * RhsRows],
+               &acc->buf.reg[r + c * Rows]);
+      }
+    }
+  }
+};
+
+template <typename Lhs, typename Rhs, typename Acc>
+void BroadcastMulAdd(const Lhs& lhs, const Rhs& rhs, Acc* acc) {
+  using Flip = FlipLhsRhs<Lhs, Rhs>;
+  BroadcastMulAddImpl<typename Flip::FlippedLhsType,
+                      typename Flip::FlippedRhsType,
+                      Acc>::Run(Flip::FlippedLhs(lhs, rhs),
+                                Flip::FlippedRhs(lhs, rhs), acc);
+}
+
+template <typename RegisterBlockType, typename SrcObjectType>
+struct LoadImpl {
+  static_assert(std::is_same<SrcObjectType, void>::value,
+                "This generic impl should never be hit");
+};
+
+template <typename ScalarType, int Rows, int Cols, typename SrcScalarType>
+struct LoadImpl<RegisterBlock<ScalarType, Rows, Cols>,
+                MatrixMap<SrcScalarType, MapOrder::ColMajor>> {
+  using RegisterBlockType = RegisterBlock<ScalarType, Rows, Cols>;
+  using SrcObjectType = MatrixMap<SrcScalarType, MapOrder::ColMajor>;
+  static RegisterBlockType Run(const SrcObjectType& src, int row, int col) {
+    RegisterBlockType result;
+    int i = 0;
+    for (int c = 0; c < Cols; c++) {
+      const ScalarType* src_ptr = src.data(row, col + c);
+      for (int r = 0; r < Rows; r++) {
+        result.buf.reg[i++] = *src_ptr++;
+      }
+    }
+    return result;
+  }
+};
+
+template <typename ScalarType, int Rows, int Cols, typename SrcScalarType,
+          VectorShape Shape>
+struct LoadImpl<RegisterBlock<ScalarType, Rows, Cols>,
+                VectorMap<SrcScalarType, Shape>> {
+  using RegisterBlockType = RegisterBlock<ScalarType, Rows, Cols>;
+  using SrcObjectType = VectorMap<SrcScalarType, Shape>;
+  static RegisterBlockType Run(const SrcObjectType& src, int pos) {
+    static_assert(Shape == VectorShape::Col || Rows == 1, "");
+    static_assert(Shape == VectorShape::Row || Cols == 1, "");
+    RegisterBlockType result;
+    for (int i = 0; i < Rows * Cols; i++) {
+      result.buf.reg[i] = src(pos + i);
+    }
+    return result;
+  }
+};
+
+template <typename ScalarType, int Rows, int Cols, typename SrcScalarType,
+          VectorShape Shape>
+struct LoadImpl<RegisterBlock<ScalarType, Rows, Cols>,
+                VectorDup<SrcScalarType, Shape>> {
+  using RegisterBlockType = RegisterBlock<ScalarType, Rows, Cols>;
+  using SrcObjectType = VectorDup<SrcScalarType, Shape>;
+  static RegisterBlockType Run(const SrcObjectType& src, int) {
+    static_assert(Shape == VectorShape::Col || Rows == 1, "");
+    static_assert(Shape == VectorShape::Row || Cols == 1, "");
+    RegisterBlockType result;
+    for (int i = 0; i < Rows * Cols; i++) {
+      result.buf.reg[i] = src(0);
+    }
+    return result;
+  }
+};
+
+template <typename RegisterBlockType, typename SrcObjectType>
+RegisterBlockType Load(const SrcObjectType& src, int row, int col) {
+  return LoadImpl<RegisterBlockType, SrcObjectType>::Run(src, row, col);
+}
+
+template <typename RegisterBlockType, typename SrcObjectType>
+RegisterBlockType Load(const SrcObjectType& src, int pos) {
+  return LoadImpl<RegisterBlockType, SrcObjectType>::Run(src, pos);
+}
+
+template <typename RegisterBlockType>
+struct LoadContiguousImpl {
+  using ScalarType = typename RegisterBlockType::ScalarType;
+  static_assert(RegisterBlockType::kRegisterLanes == 1,
+                "This path is only for scalar values");
+  static RegisterBlockType Run(const ScalarType* src) {
+    RegisterBlockType result;
+    for (int i = 0; i < RegisterBlockType::kScalarCount; i++) {
+      result.buf.reg[i] = src[i];
+    }
+    return result;
+  }
+};
+
+template <typename RegisterBlockType>
+RegisterBlockType LoadContiguous(
+    const typename RegisterBlockType::ScalarType* src) {
+  return LoadContiguousImpl<RegisterBlockType>::Run(src);
+}
+
+template <int BroadcastRows, int BroadcastCols, typename SrcObjectType>
+struct LoadForBroadcastingShape {};
+
+template <int BroadcastRows, int BroadcastCols, typename ScalarType,
+          VectorShape Shape>
+struct LoadForBroadcastingShape<BroadcastRows, BroadcastCols,
+                                VectorMap<ScalarType, Shape>> {
+  static constexpr int kRows = Shape == VectorShape::Col ? BroadcastRows : 1;
+  static constexpr int kCols = Shape == VectorShape::Row ? BroadcastCols : 1;
+};
+
+template <int BroadcastRows, int BroadcastCols, typename ScalarType,
+          VectorShape Shape>
+struct LoadForBroadcastingShape<BroadcastRows, BroadcastCols,
+                                VectorDup<ScalarType, Shape>> {
+  static constexpr int kRows = 1;
+  static constexpr int kCols = 1;
+};
+
+template <typename RegisterBlockType, typename SrcObjectType>
+struct LoadForBroadcastingRegisterBlock {
+  using Shape =
+      LoadForBroadcastingShape<RegisterBlockType::kRows,
+                               RegisterBlockType::kCols, SrcObjectType>;
+  using ScalarType = typename RegisterBlockType::ScalarType;
+  using Type = RegisterBlock<ScalarType, Shape::kRows, Shape::kCols>;
+};
+
+template <typename RegisterBlockType, typename SrcObjectType>
+struct LoadForBroadcastingImpl {
+  static_assert(std::is_same<SrcObjectType, void>::value,
+                "This generic impl should never be hit");
+};
+
+template <typename ScalarType, int Rows, int Cols, typename SrcScalarType,
+          VectorShape Shape>
+struct LoadForBroadcastingImpl<RegisterBlock<ScalarType, Rows, Cols>,
+                               VectorMap<SrcScalarType, Shape>> {
+  using RegisterBlockType = RegisterBlock<ScalarType, Rows, Cols>;
+  using SrcObjectType = VectorMap<SrcScalarType, Shape>;
+  using ResultBlockType =
+      typename LoadForBroadcastingRegisterBlock<RegisterBlockType,
+                                                SrcObjectType>::Type;
+  static_assert(ResultBlockType::kRegisterLanes == 1,
+                "This path is only for scalar values");
+  static ResultBlockType Run(const SrcObjectType& src, int pos) {
+    ResultBlockType result;
+    for (int c = 0; c < ResultBlockType::kCols; c++) {
+      for (int r = 0; r < ResultBlockType::kRows; r++) {
+        const int i = Shape == VectorShape::Col ? r : c;
+        result.buf.reg[r + c * ResultBlockType::kRows] = src(pos + i);
+      }
+    }
+    return result;
+  }
+};
+
+template <typename ScalarType, int Rows, int Cols, typename SrcScalarType,
+          VectorShape Shape>
+struct LoadForBroadcastingImpl<RegisterBlock<ScalarType, Rows, Cols>,
+                               VectorDup<SrcScalarType, Shape>> {
+  using RegisterBlockType = RegisterBlock<ScalarType, Rows, Cols>;
+  using SrcObjectType = VectorDup<SrcScalarType, Shape>;
+  using ResultBlockType =
+      typename LoadForBroadcastingRegisterBlock<RegisterBlockType,
+                                                SrcObjectType>::Type;
+  static_assert(ResultBlockType::kRegisterLanes == 1,
+                "This path is only for scalar values");
+  static ResultBlockType Run(const SrcObjectType& src, int) {
+    ResultBlockType result;
+    for (int c = 0; c < ResultBlockType::kCols; c++) {
+      for (int r = 0; r < ResultBlockType::kRows; r++) {
+        result.buf.reg[r + c * ResultBlockType::kRows] = src(0);
+      }
+    }
+    return result;
+  }
+};
+
+template <typename RegisterBlockType, typename SrcObjectType>
+typename LoadForBroadcastingRegisterBlock<RegisterBlockType,
+                                          SrcObjectType>::Type
+LoadForBroadcasting(const SrcObjectType& src, int row, int col) {
+  return LoadForBroadcastingImpl<RegisterBlockType, SrcObjectType>::Run(
+      src, row, col);
+}
+
+template <typename RegisterBlockType, typename SrcObjectType>
+typename LoadForBroadcastingRegisterBlock<RegisterBlockType,
+                                          SrcObjectType>::Type
+LoadForBroadcasting(const SrcObjectType& src, int pos) {
+  return LoadForBroadcastingImpl<RegisterBlockType, SrcObjectType>::Run(src,
+                                                                        pos);
+}
+
+template <int ConstantValue, typename RegisterBlockType>
+struct AddConstantImpl {
+  static void Run(RegisterBlockType* block) {
+    using RegisterType = typename RegisterBlockType::RegisterType;
+    const RegisterType dup = Dup<RegisterType>(ConstantValue);
+    for (int i = 0; i < RegisterBlockType::kRegisterCount; i++) {
+      block->buf.reg[i] = Add(block->buf.reg[i], dup);
+    }
+  }
+};
+
+template <typename RegisterBlockType>
+struct AddConstantImpl<0, RegisterBlockType> {
+  static void Run(RegisterBlockType*) {
+    // This is a no-op.
+  }
+};
+
+template <int ConstantValue, typename RegisterBlockType>
+void AddConstant(RegisterBlockType* block) {
+  AddConstantImpl<ConstantValue, RegisterBlockType>::Run(block);
+}
+
+template <int N>
+using RegBufferInt32 = RegisterBuffer<std::int32_t, N>;
+template <int N>
+using RegBufferUint8 = RegisterBuffer<std::uint8_t, N>;
+template <int R, int C>
+using RegBlockInt32 = RegisterBlock<std::int32_t, R, C>;
+template <int R, int C>
+using RegBlockUint8 = RegisterBlock<std::uint8_t, R, C>;
+
+}  // end namespace gemmlowp
+
+#if defined GEMMLOWP_NEON
+#include "simd_wrappers_neon.h"
+#elif defined GEMMLOWP_SSE4
+#include "simd_wrappers_sse.h"
+#endif
+
+#endif  // GEMMLOWP_INTERNAL_SIMD_WRAPPERS_H_
diff --git a/internal/simd_wrappers_common_neon_sse.h b/internal/simd_wrappers_common_neon_sse.h
new file mode 100644
index 0000000..3830eb1
--- /dev/null
+++ b/internal/simd_wrappers_common_neon_sse.h
@@ -0,0 +1,646 @@
+// Copyright 2015 Google Inc. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// simd_wrappers_common_neon_sse.h: common SIMD (NEON and SSE) wrapper code
+
+#ifndef GEMMLOWP_INTERNAL_SIMD_WRAPPERS_COMMON_NEON_SSE_H_
+#define GEMMLOWP_INTERNAL_SIMD_WRAPPERS_COMMON_NEON_SSE_H_
+
+#include "simd_wrappers.h"
+
+namespace gemmlowp {
+
+template <typename SrcScalarType, int N>
+struct LoadImpl<RegBlockInt32<4, N>,
+                MatrixMap<SrcScalarType, MapOrder::ColMajor>> {
+  static RegBlockInt32<4, N> Run(
+      const MatrixMap<SrcScalarType, MapOrder::ColMajor>& src, int row,
+      int col) {
+    RegBlockInt32<4, N> result;
+    for (int i = 0; i < N; i++) {
+      result.buf.reg[i] = LoadInt32x4(src.data(row, col + i));
+    }
+    return result;
+  }
+};
+
+template <typename SrcScalarType, int N>
+struct LoadImpl<RegBlockInt32<8, N>,
+                MatrixMap<SrcScalarType, MapOrder::ColMajor>> {
+  static RegBlockInt32<8, N> Run(
+      const MatrixMap<SrcScalarType, MapOrder::ColMajor>& src, int row,
+      int col) {
+    RegBlockInt32<8, N> result;
+    for (int i = 0; i < N; i++) {
+      result.buf.reg[2 * i + 0] = LoadInt32x4(src.data(row + 0, col + i));
+      result.buf.reg[2 * i + 1] = LoadInt32x4(src.data(row + 4, col + i));
+    }
+    return result;
+  }
+};
+
+template <typename SrcScalarType>
+struct LoadImpl<RegBlockInt32<1, 4>,
+                MatrixMap<SrcScalarType, MapOrder::ColMajor>> {
+  static RegBlockInt32<1, 4> Run(
+      const MatrixMap<SrcScalarType, MapOrder::ColMajor>& src, int row,
+      int col) {
+    RegBlockInt32<1, 4> result;
+    std::int32_t buf[4];
+    for (int i = 0; i < 4; i++) {
+      buf[i] = src(row, col + i);
+    }
+    result.buf.reg[0] = LoadInt32x4(buf);
+    return result;
+  }
+};
+
+template <typename SrcScalarType>
+struct LoadImpl<RegBlockInt32<1, 8>,
+                MatrixMap<SrcScalarType, MapOrder::ColMajor>> {
+  static RegBlockInt32<1, 8> Run(
+      const MatrixMap<SrcScalarType, MapOrder::ColMajor>& src, int row,
+      int col) {
+    RegBlockInt32<1, 8> result;
+    std::int32_t buf[8];
+    for (int i = 0; i < 8; i++) {
+      buf[i] = src(row, col + i);
+    }
+    result.buf.reg[0] = LoadInt32x4(buf);
+    result.buf.reg[1] = LoadInt32x4(buf + 4);
+    return result;
+  }
+};
+
+template <typename SrcScalarType>
+struct LoadImpl<RegBlockInt32<4, 1>,
+                VectorMap<SrcScalarType, VectorShape::Col>> {
+  static RegBlockInt32<4, 1> Run(
+      const VectorMap<SrcScalarType, VectorShape::Col>& src, int pos) {
+    RegBlockInt32<4, 1> result;
+    result.buf.reg[0] = LoadInt32x4(src.data(pos));
+    return result;
+  }
+};
+
+template <typename SrcScalarType>
+struct LoadImpl<RegBlockInt32<4, 1>,
+                VectorDup<SrcScalarType, VectorShape::Col>> {
+  static RegBlockInt32<4, 1> Run(
+      const VectorDup<SrcScalarType, VectorShape::Col>& src, int) {
+    RegBlockInt32<4, 1> result;
+    result.buf.reg[0] = LoadInt32x4(src(0));
+    return result;
+  }
+};
+
+template <typename SrcScalarType, int N>
+struct LoadForBroadcastingImpl<RegBlockInt32<4, N>,
+                               VectorMap<SrcScalarType, VectorShape::Col>> {
+  using SrcObjectType = VectorMap<SrcScalarType, VectorShape::Col>;
+  using RegisterBlockType = RegBlockInt32<4, N>;
+  using ResultBlockType =
+      typename LoadForBroadcastingRegisterBlock<RegisterBlockType,
+                                                SrcObjectType>::Type;
+
+  static ResultBlockType Run(const SrcObjectType& src, int pos) {
+    ResultBlockType result;
+    static_assert(ResultBlockType::kRegisterCount == 1, "");
+    result.buf.reg[0] = LoadInt32x4(src.data(pos));
+    return result;
+  }
+};
+
+template <typename SrcScalarType, int N>
+struct LoadForBroadcastingImpl<RegBlockInt32<8, N>,
+                               VectorMap<SrcScalarType, VectorShape::Col>> {
+  using SrcObjectType = VectorMap<SrcScalarType, VectorShape::Col>;
+  using RegisterBlockType = RegBlockInt32<8, N>;
+  using ResultBlockType =
+      typename LoadForBroadcastingRegisterBlock<RegisterBlockType,
+                                                SrcObjectType>::Type;
+
+  static ResultBlockType Run(const SrcObjectType& src, int pos) {
+    ResultBlockType result;
+    static_assert(ResultBlockType::kRegisterCount == 2, "");
+    result.buf.reg[0] = LoadInt32x4(src.data(pos));
+    result.buf.reg[1] = LoadInt32x4(src.data(pos + 4));
+    return result;
+  }
+};
+
+template <typename SrcScalarType>
+struct LoadForBroadcastingImpl<RegBlockInt32<4, 1>,
+                               VectorMap<SrcScalarType, VectorShape::Row>> {
+  using SrcObjectType = VectorMap<SrcScalarType, VectorShape::Row>;
+  using RegisterBlockType = RegBlockInt32<4, 1>;
+  using ResultBlockType =
+      typename LoadForBroadcastingRegisterBlock<RegisterBlockType,
+                                                SrcObjectType>::Type;
+
+  static ResultBlockType Run(const SrcObjectType& src, int pos) {
+    ResultBlockType result;
+    result.buf.reg[0] = src(pos);
+    return result;
+  }
+};
+
+template <typename SrcScalarType, int N>
+struct LoadForBroadcastingImpl<RegBlockInt32<N, 4>,
+                               VectorMap<SrcScalarType, VectorShape::Row>> {
+  using SrcObjectType = VectorMap<SrcScalarType, VectorShape::Row>;
+  using RegisterBlockType = RegBlockInt32<N, 4>;
+  using ResultBlockType =
+      typename LoadForBroadcastingRegisterBlock<RegisterBlockType,
+                                                SrcObjectType>::Type;
+
+  static ResultBlockType Run(const SrcObjectType& src, int pos) {
+    ResultBlockType result;
+    static_assert(ResultBlockType::kRegisterCount == 1, "");
+    result.buf.reg[0] = LoadInt32x4(src.data(pos));
+    return result;
+  }
+};
+
+template <typename SrcScalarType, int N>
+struct LoadForBroadcastingImpl<RegBlockInt32<N, 8>,
+                               VectorMap<SrcScalarType, VectorShape::Row>> {
+  using SrcObjectType = VectorMap<SrcScalarType, VectorShape::Row>;
+  using RegisterBlockType = RegBlockInt32<N, 8>;
+  using ResultBlockType =
+      typename LoadForBroadcastingRegisterBlock<RegisterBlockType,
+                                                SrcObjectType>::Type;
+
+  static ResultBlockType Run(const SrcObjectType& src, int pos) {
+    ResultBlockType result;
+    static_assert(ResultBlockType::kRegisterCount == 2, "");
+    result.buf.reg[0] = LoadInt32x4(src.data(pos));
+    result.buf.reg[1] = LoadInt32x4(src.data(pos + 4));
+    return result;
+  }
+};
+
+// 4x1 := 4x1 + 1x1
+template <>
+struct BroadcastAddImpl<RegBlockInt32<4, 1>, RegBlockInt32<1, 1>> {
+  static RegBlockInt32<4, 1> Run(const RegBlockInt32<4, 1>& lhs,
+                                 const RegBlockInt32<1, 1>& rhs) {
+    RegBlockInt32<4, 1> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], Dup<Int32x4>(rhs.buf.reg[0]));
+    return result;
+  }
+};
+
+// 1x4 := 1x4 + 1x1
+template <>
+struct BroadcastAddImpl<RegBlockInt32<1, 4>, RegBlockInt32<1, 1>> {
+  static RegBlockInt32<1, 4> Run(const RegBlockInt32<1, 4>& lhs,
+                                 const RegBlockInt32<1, 1>& rhs) {
+    RegBlockInt32<1, 4> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], Dup<Int32x4>(rhs.buf.reg[0]));
+    return result;
+  }
+};
+
+// 4x1 := 4x1 + 4x1
+template <>
+struct BroadcastAddImpl<RegBlockInt32<4, 1>, RegBlockInt32<4, 1>> {
+  static RegBlockInt32<4, 1> Run(const RegBlockInt32<4, 1>& lhs,
+                                 const RegBlockInt32<4, 1>& rhs) {
+    RegBlockInt32<4, 1> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], rhs.buf.reg[0]);
+    return result;
+  }
+};
+
+// 1x4 := 1x4 + 1x4
+template <>
+struct BroadcastAddImpl<RegBlockInt32<1, 4>, RegBlockInt32<1, 4>> {
+  static RegBlockInt32<1, 4> Run(const RegBlockInt32<1, 4>& lhs,
+                                 const RegBlockInt32<1, 4>& rhs) {
+    RegBlockInt32<1, 4> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], rhs.buf.reg[0]);
+    return result;
+  }
+};
+
+// 4x4 := 4x4 + 1x4
+template <>
+struct BroadcastAddImpl<RegBlockInt32<4, 4>, RegBlockInt32<1, 4>> {
+  static RegBlockInt32<4, 4> Run(const RegBlockInt32<4, 4>& lhs,
+                                 const RegBlockInt32<1, 4>& rhs) {
+    RegBlockInt32<4, 4> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], DupLane<0>(rhs.buf.reg[0]));
+    result.buf.reg[1] = Add(lhs.buf.reg[1], DupLane<1>(rhs.buf.reg[0]));
+    result.buf.reg[2] = Add(lhs.buf.reg[2], DupLane<2>(rhs.buf.reg[0]));
+    result.buf.reg[3] = Add(lhs.buf.reg[3], DupLane<3>(rhs.buf.reg[0]));
+    return result;
+  }
+};
+
+// 4x4 := 4x4 + 4x1
+template <>
+struct BroadcastAddImpl<RegBlockInt32<4, 4>, RegBlockInt32<4, 1>> {
+  static RegBlockInt32<4, 4> Run(const RegBlockInt32<4, 4>& lhs,
+                                 const RegBlockInt32<4, 1>& rhs) {
+    RegBlockInt32<4, 4> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], rhs.buf.reg[0]);
+    result.buf.reg[1] = Add(lhs.buf.reg[1], rhs.buf.reg[0]);
+    result.buf.reg[2] = Add(lhs.buf.reg[2], rhs.buf.reg[0]);
+    result.buf.reg[3] = Add(lhs.buf.reg[3], rhs.buf.reg[0]);
+    return result;
+  }
+};
+
+// 8x1 := 8x1 + 1x1
+template <>
+struct BroadcastAddImpl<RegBlockInt32<8, 1>, RegBlockInt32<1, 1>> {
+  static RegBlockInt32<8, 1> Run(const RegBlockInt32<8, 1>& lhs,
+                                 const RegBlockInt32<1, 1>& rhs) {
+    RegBlockInt32<8, 1> result;
+    const Int32x4 p = Dup<Int32x4>(rhs.buf.reg[0]);
+    for (int i = 0; i < 2; i++) {
+      result.buf.reg[i] = Add(lhs.buf.reg[i], p);
+    }
+    return result;
+  }
+};
+
+// 8x1 := 8x1 + 8x1
+template <>
+struct BroadcastAddImpl<RegBlockInt32<8, 1>, RegBlockInt32<8, 1>> {
+  static RegBlockInt32<8, 1> Run(const RegBlockInt32<8, 1>& lhs,
+                                 const RegBlockInt32<8, 1>& rhs) {
+    RegBlockInt32<8, 1> result;
+    for (int i = 0; i < 2; i++) {
+      result.buf.reg[i] = Add(lhs.buf.reg[i], rhs.buf.reg[i]);
+    }
+    return result;
+  }
+};
+
+// 8x4 := 8x4 + 1x4
+template <>
+struct BroadcastAddImpl<RegBlockInt32<8, 4>, RegBlockInt32<1, 4>> {
+  static RegBlockInt32<8, 4> Run(const RegBlockInt32<8, 4>& lhs,
+                                 const RegBlockInt32<1, 4>& rhs) {
+    RegBlockInt32<8, 4> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], DupLane<0>(rhs.buf.reg[0]));
+    result.buf.reg[1] = Add(lhs.buf.reg[1], DupLane<0>(rhs.buf.reg[0]));
+    result.buf.reg[2] = Add(lhs.buf.reg[2], DupLane<1>(rhs.buf.reg[0]));
+    result.buf.reg[3] = Add(lhs.buf.reg[3], DupLane<1>(rhs.buf.reg[0]));
+    result.buf.reg[4] = Add(lhs.buf.reg[4], DupLane<2>(rhs.buf.reg[0]));
+    result.buf.reg[5] = Add(lhs.buf.reg[5], DupLane<2>(rhs.buf.reg[0]));
+    result.buf.reg[6] = Add(lhs.buf.reg[6], DupLane<3>(rhs.buf.reg[0]));
+    result.buf.reg[7] = Add(lhs.buf.reg[7], DupLane<3>(rhs.buf.reg[0]));
+    return result;
+  }
+};
+
+// 8x4 := 8x4 + 8x1
+template <>
+struct BroadcastAddImpl<RegBlockInt32<8, 4>, RegBlockInt32<8, 1>> {
+  static RegBlockInt32<8, 4> Run(const RegBlockInt32<8, 4>& lhs,
+                                 const RegBlockInt32<8, 1>& rhs) {
+    RegBlockInt32<8, 4> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], rhs.buf.reg[0]);
+    result.buf.reg[1] = Add(lhs.buf.reg[1], rhs.buf.reg[1]);
+    result.buf.reg[2] = Add(lhs.buf.reg[2], rhs.buf.reg[0]);
+    result.buf.reg[3] = Add(lhs.buf.reg[3], rhs.buf.reg[1]);
+    result.buf.reg[4] = Add(lhs.buf.reg[4], rhs.buf.reg[0]);
+    result.buf.reg[5] = Add(lhs.buf.reg[5], rhs.buf.reg[1]);
+    result.buf.reg[6] = Add(lhs.buf.reg[6], rhs.buf.reg[0]);
+    result.buf.reg[7] = Add(lhs.buf.reg[7], rhs.buf.reg[1]);
+    return result;
+  }
+};
+
+// 1x8 := 1x8 + 1x8
+template <>
+struct BroadcastAddImpl<RegBlockInt32<1, 8>, RegBlockInt32<1, 8>> {
+  static RegBlockInt32<1, 8> Run(const RegBlockInt32<1, 8>& lhs,
+                                 const RegBlockInt32<1, 8>& rhs) {
+    RegBlockInt32<1, 8> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], rhs.buf.reg[0]);
+    result.buf.reg[1] = Add(lhs.buf.reg[1], rhs.buf.reg[1]);
+    return result;
+  }
+};
+
+// 1x8 := 1x8 + 1x1
+template <>
+struct BroadcastAddImpl<RegBlockInt32<1, 8>, RegBlockInt32<1, 1>> {
+  static RegBlockInt32<1, 8> Run(const RegBlockInt32<1, 8>& lhs,
+                                 const RegBlockInt32<1, 1>& rhs) {
+    RegBlockInt32<1, 8> result;
+    result.buf.reg[0] = Add(lhs.buf.reg[0], Dup<Int32x4>(rhs.buf.reg[0]));
+    result.buf.reg[1] = Add(lhs.buf.reg[1], Dup<Int32x4>(rhs.buf.reg[0]));
+    return result;
+  }
+};
+
+// 4x1 := 4x1 * 1x1
+template <>
+struct BroadcastMulImpl<RegBlockInt32<4, 1>, RegBlockInt32<1, 1>> {
+  static RegBlockInt32<4, 1> Run(const RegBlockInt32<4, 1>& lhs,
+                                 const RegBlockInt32<1, 1>& rhs) {
+    RegBlockInt32<4, 1> result;
+    result.buf.reg[0] = Mul(lhs.buf.reg[0], Dup<Int32x4>(rhs.buf.reg[0]));
+    return result;
+  }
+};
+
+// 4x1 := 4x1 * 4x1
+template <>
+struct BroadcastMulImpl<RegBlockInt32<4, 1>, RegBlockInt32<4, 1>> {
+  static RegBlockInt32<4, 1> Run(const RegBlockInt32<4, 1>& lhs,
+                                 const RegBlockInt32<4, 1>& rhs) {
+    RegBlockInt32<4, 1> result;
+    result.buf.reg[0] = Mul(lhs.buf.reg[0], rhs.buf.reg[0]);
+    return result;
+  }
+};
+
+// 1x4 := 1x4 * 1x4
+template <>
+struct BroadcastMulImpl<RegBlockInt32<1, 4>, RegBlockInt32<1, 4>> {
+  static RegBlockInt32<1, 4> Run(const RegBlockInt32<1, 4>& lhs,
+                                 const RegBlockInt32<1, 4>& rhs) {
+    RegBlockInt32<1, 4> result;
+    result.buf.reg[0] = Mul(lhs.buf.reg[0], rhs.buf.reg[0]);
+    return result;
+  }
+};
+
+// 1x4 := 1x4 * 1x1
+template <>
+struct BroadcastMulImpl<RegBlockInt32<1, 4>, RegBlockInt32<1, 1>> {
+  static RegBlockInt32<1, 4> Run(const RegBlockInt32<1, 4>& lhs,
+                                 const RegBlockInt32<1, 1>& rhs) {
+    RegBlockInt32<1, 4> result;
+    result.buf.reg[0] = Mul(lhs.buf.reg[0], rhs.buf.reg[0]);
+    return result;
+  }
+};
+
+// 4x4 := 4x4 * 1x4
+template <>
+struct BroadcastMulImpl<RegBlockInt32<4, 4>, RegBlockInt32<1, 4>> {
+  static RegBlockInt32<4, 4> Run(const RegBlockInt32<4, 4>& lhs,
+                                 const RegBlockInt32<1, 4>& rhs) {
+    RegBlockInt32<4, 4> result;
+    const Int32x4 p = rhs.buf.reg[0];
+    result.buf.reg[0] = MulByRhsLane<0>(lhs.buf.reg[0], p);
+    result.buf.reg[1] = MulByRhsLane<1>(lhs.buf.reg[1], p);
+    result.buf.reg[2] = MulByRhsLane<2>(lhs.buf.reg[2], p);
+    result.buf.reg[3] = MulByRhsLane<3>(lhs.buf.reg[3], p);
+    return result;
+  }
+};
+
+// 4x4 := 4x4 * 4x1
+template <>
+struct BroadcastMulImpl<RegBlockInt32<4, 4>, RegBlockInt32<4, 1>> {
+  static RegBlockInt32<4, 4> Run(const RegBlockInt32<4, 4>& lhs,
+                                 const RegBlockInt32<4, 1>& rhs) {
+    RegBlockInt32<4, 4> result;
+    const Int32x4 p = rhs.buf.reg[0];
+    result.buf.reg[0] = Mul(lhs.buf.reg[0], p);
+    result.buf.reg[1] = Mul(lhs.buf.reg[1], p);
+    result.buf.reg[2] = Mul(lhs.buf.reg[2], p);
+    result.buf.reg[3] = Mul(lhs.buf.reg[3], p);
+    return result;
+  }
+};
+
+// 8x1 := 8x1 * 1x1
+template <>
+struct BroadcastMulImpl<RegBlockInt32<8, 1>, RegBlockInt32<1, 1>> {
+  static RegBlockInt32<8, 1> Run(const RegBlockInt32<8, 1>& lhs,
+                                 const RegBlockInt32<1, 1>& rhs) {
+    RegBlockInt32<8, 1> result;
+    const std::int32_t p = rhs.buf.reg[0];
+    for (int i = 0; i < 2; i++) {
+      result.buf.reg[i] = Mul(lhs.buf.reg[i], p);
+    }
+    return result;
+  }
+};
+
+// 8x1 := 8x1 * 8x1
+template <>
+struct BroadcastMulImpl<RegBlockInt32<8, 1>, RegBlockInt32<8, 1>> {
+  static RegBlockInt32<8, 1> Run(const RegBlockInt32<8, 1>& lhs,
+                                 const RegBlockInt32<8, 1>& rhs) {
+    RegBlockInt32<8, 1> result;
+    for (int i = 0; i < 2; i++) {
+      result.buf.reg[i] = Mul(lhs.buf.reg[i], rhs.buf.reg[i]);
+    }
+    return result;
+  }
+};
+
+// 8x4 := 8x4 * 1x4
+template <>
+struct BroadcastMulImpl<RegBlockInt32<8, 4>, RegBlockInt32<1, 4>> {
+  static RegBlockInt32<8, 4> Run(const RegBlockInt32<8, 4>& lhs,
+                                 const RegBlockInt32<1, 4>& rhs) {
+    RegBlockInt32<8, 4> result;
+    const Int32x4 p = rhs.buf.reg[0];
+    for (int i = 0; i < 2; i++) {
+      result.buf.reg[i + 0] = MulByRhsLane<0>(lhs.buf.reg[i + 0], p);
+      result.buf.reg[i + 2] = MulByRhsLane<1>(lhs.buf.reg[i + 2], p);
+      result.buf.reg[i + 4] = MulByRhsLane<2>(lhs.buf.reg[i + 4], p);
+      result.buf.reg[i + 6] = MulByRhsLane<3>(lhs.buf.reg[i + 6], p);
+    }
+    return result;
+  }
+};
+
+// 8x4 := 8x4 * 8x1
+template <>
+struct BroadcastMulImpl<RegBlockInt32<8, 4>, RegBlockInt32<8, 1>> {
+  static RegBlockInt32<8, 4> Run(const RegBlockInt32<8, 4>& lhs,
+                                 const RegBlockInt32<8, 1>& rhs) {
+    RegBlockInt32<8, 4> result;
+    const Int32x4 p[2]{rhs.buf.reg[0], rhs.buf.reg[1]};
+    for (int i = 0; i < 4; i++) {
+      for (int j = 0; j < 2; j++) {
+        const int k = j + 2 * i;
+        result.buf.reg[k] = Mul(lhs.buf.reg[k], p[j]);
+      }
+    }
+    return result;
+  }
+};
+
+// Rx1 += Rx1 * 1x1
+template <int Rows>
+struct BroadcastMulAddImpl<RegBlockInt32<Rows, 1>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<Rows, 1>> {
+  static void Run(const RegBlockInt32<Rows, 1>& lhs,
+                  const RegBlockInt32<1, 1>& rhs, RegBlockInt32<Rows, 1>* acc) {
+    const std::int32_t p = rhs.buf.reg[0];
+    for (int i = 0; i < RegBlockInt32<Rows, 1>::kRegisterCount; i++) {
+      MulAdd(lhs.buf.reg[i], p, &acc->buf.reg[i]);
+    }
+  }
+};
+
+// RxC += Rx1 * 1x1
+template <int Rows, int Cols>
+struct BroadcastMulAddImpl<RegBlockInt32<Rows, 1>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<Rows, Cols>> {
+  static void Run(const RegBlockInt32<Rows, 1>& lhs,
+                  const RegBlockInt32<1, 1>& rhs,
+                  RegBlockInt32<Rows, Cols>* acc) {
+    const std::int32_t p = rhs.buf.reg[0];
+    static constexpr int kRegsPerCol = RegBlockInt32<Rows, 1>::kRegisterCount;
+    for (int i = 0; i < kRegsPerCol; i++) {
+      const Int32x4 q = Mul(lhs.buf.reg[i], p);
+      for (int j = 0; j < Cols; j++) {
+        acc->buf.reg[i + j * kRegsPerCol] =
+            Add(acc->buf.reg[i + j * kRegsPerCol], q);
+      }
+    }
+  }
+};
+
+// 1xC += 1xC * 1x1
+template <int Cols>
+struct BroadcastMulAddImpl<RegBlockInt32<1, Cols>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<1, Cols>> {
+  static void Run(const RegBlockInt32<1, Cols>& lhs,
+                  const RegBlockInt32<1, 1>& rhs, RegBlockInt32<1, Cols>* acc) {
+    const std::int32_t p = rhs.buf.reg[0];
+    for (int i = 0; i < RegBlockInt32<1, Cols>::kRegisterCount; i++) {
+      MulAdd(lhs.buf.reg[i], p, &acc->buf.reg[i]);
+    }
+  }
+};
+
+// RxC += 1x1 * 1x1
+template <int Rows, int Cols>
+struct BroadcastMulAddImpl<RegBlockInt32<1, 1>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<Rows, Cols>> {
+  static void Run(const RegBlockInt32<1, 1>& lhs,
+                  const RegBlockInt32<1, 1>& rhs,
+                  RegBlockInt32<Rows, Cols>* acc) {
+    const Int32x4 p = Dup<Int32x4>(Mul(lhs.buf.reg[0], rhs.buf.reg[0]));
+    for (int i = 0; i < RegBlockInt32<Rows, Cols>::kRegisterCount; i++) {
+      acc->buf.reg[i] = Add(acc->buf.reg[i], p);
+    }
+  }
+};
+
+// 1x1 += 1x1 * 1x1
+template <>
+struct BroadcastMulAddImpl<RegBlockInt32<1, 1>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<1, 1>> {
+  static void Run(const RegBlockInt32<1, 1>& lhs,
+                  const RegBlockInt32<1, 1>& rhs, RegBlockInt32<1, 1>* acc) {
+    MulAdd(lhs.buf.reg[0], rhs.buf.reg[0], &acc->buf.reg[0]);
+  }
+};
+
+// Rx4 += Rx1 * 1x4
+template <int Rows>
+struct BroadcastMulAddImpl<RegBlockInt32<Rows, 1>, RegBlockInt32<1, 4>,
+                           RegBlockInt32<Rows, 4>> {
+  static void Run(const RegBlockInt32<Rows, 1>& lhs,
+                  const RegBlockInt32<1, 4>& rhs, RegBlockInt32<Rows, 4>* acc) {
+    const Int32x4 p = rhs.buf.reg[0];
+    static constexpr int kRegsPerCol = RegBlockInt32<Rows, 1>::kRegisterCount;
+    for (int i = 0; i < kRegsPerCol; i++) {
+      MulAddByRhsLane<0>(lhs.buf.reg[i], p, &acc->buf.reg[i + 0 * kRegsPerCol]);
+      MulAddByRhsLane<1>(lhs.buf.reg[i], p, &acc->buf.reg[i + 1 * kRegsPerCol]);
+      MulAddByRhsLane<2>(lhs.buf.reg[i], p, &acc->buf.reg[i + 2 * kRegsPerCol]);
+      MulAddByRhsLane<3>(lhs.buf.reg[i], p, &acc->buf.reg[i + 3 * kRegsPerCol]);
+    }
+  }
+};
+
+// Rx4 += 1x4 * 1x1
+template <int Rows>
+struct BroadcastMulAddImpl<RegBlockInt32<1, 4>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<Rows, 4>> {
+  static void Run(const RegBlockInt32<1, 4>& lhs,
+                  const RegBlockInt32<1, 1>& rhs, RegBlockInt32<Rows, 4>* acc) {
+    const Int32x4 p = Mul(lhs.buf.reg[0], rhs.buf.reg[0]);
+    Int32x4 q[4];
+    q[0] = DupLane<0>(p);
+    q[1] = DupLane<1>(p);
+    q[2] = DupLane<2>(p);
+    q[3] = DupLane<3>(p);
+    static constexpr int kRegsPerCol = RegBlockInt32<Rows, 1>::kRegisterCount;
+    for (int i = 0; i < kRegsPerCol; i++) {
+      for (int j = 0; j < 4; j++) {
+        acc->buf.reg[i + j * kRegsPerCol] =
+            Add(q[j], acc->buf.reg[i + j * kRegsPerCol]);
+      }
+    }
+  }
+};
+
+// 1xC += 1x1 * 1x1
+template <int Cols>
+struct BroadcastMulAddImpl<RegBlockInt32<1, 1>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<1, Cols>> {
+  static void Run(const RegBlockInt32<1, 1>& lhs,
+                  const RegBlockInt32<1, 1>& rhs, RegBlockInt32<1, Cols>* acc) {
+    const Int32x4 p = Dup<Int32x4>(Mul(lhs.buf.reg[0], rhs.buf.reg[0]));
+    for (int i = 0; i < RegBlockInt32<1, Cols>::kRegisterCount; i++) {
+      acc->buf.reg[i] = Add(acc->buf.reg[i], p);
+    }
+  }
+};
+
+// 1x4 += 1x4 * 1x1
+template <>
+struct BroadcastMulAddImpl<RegBlockInt32<1, 4>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<1, 4>> {
+  static void Run(const RegBlockInt32<1, 4>& lhs,
+                  const RegBlockInt32<1, 1>& rhs, RegBlockInt32<1, 4>* acc) {
+    const std::int32_t p = rhs.buf.reg[0];
+    MulAdd(lhs.buf.reg[0], p, &acc->buf.reg[0]);
+  }
+};
+
+// 4xC += 4x1 * 1x1
+template <int Cols>
+struct BroadcastMulAddImpl<RegBlockInt32<4, 1>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<4, Cols>> {
+  static void Run(const RegBlockInt32<4, 1>& lhs,
+                  const RegBlockInt32<1, 1>& rhs, RegBlockInt32<4, Cols>* acc) {
+    const Int32x4 p = Mul(lhs.buf.reg[0], rhs.buf.reg[0]);
+    for (int i = 0; i < Cols; i++) {
+      acc->buf.reg[i] = Add(p, acc->buf.reg[i]);
+    }
+  }
+};
+
+// 4x1 += 4x1 * 1x1
+template <>
+struct BroadcastMulAddImpl<RegBlockInt32<4, 1>, RegBlockInt32<1, 1>,
+                           RegBlockInt32<4, 1>> {
+  static void Run(const RegBlockInt32<4, 1>& lhs,
+                  const RegBlockInt32<1, 1>& rhs, RegBlockInt32<4, 1>* acc) {
+    const std::int32_t p = rhs.buf.reg[0];
+    MulAdd(lhs.buf.reg[0], p, &acc->buf.reg[0]);
+  }
+};
+
+}  // namespace gemmlowp
+
+#endif  // GEMMLOWP_INTERNAL_SIMD_WRAPPERS_COMMON_NEON_SSE_H_
diff --git a/internal/simd_wrappers_neon.h b/internal/simd_wrappers_neon.h
new file mode 100644
index 0000000..c992b15
--- /dev/null
+++ b/internal/simd_wrappers_neon.h
@@ -0,0 +1,150 @@
+// Copyright 2017 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// simd_wrappers_neon.h: NEON specialization of simd_wrappers.h
+
+#ifndef GEMMLOWP_INTERNAL_SIMD_WRAPPERS_NEON_H_
+#define GEMMLOWP_INTERNAL_SIMD_WRAPPERS_NEON_H_
+
+#include <arm_neon.h>
+
+namespace gemmlowp {
+
+using Int32x4 = int32x4_t;
+using Uint8x8 = uint8x8_t;
+
+template <int ScalarCount>
+struct RegisterType<std::int32_t, ScalarCount> {
+  using Type =
+      typename std::conditional<ScalarCount >= 4, Int32x4, std::int32_t>::type;
+};
+
+template <int ScalarCount>
+struct RegisterType<std::uint8_t, ScalarCount> {
+  using Type = typename std::conditional<
+      ScalarCount >= 8, Uint8x8,
+      typename std::conditional<ScalarCount >= 4, std::uint32_t,
+                                std::uint8_t>::type>::type;
+};
+
+inline Int32x4 LoadInt32x4(const std::int32_t* src) { return vld1q_s32(src); }
+
+inline void StoreInt32x4(std::int32_t* dst, Int32x4 value) {
+  vst1q_s32(dst, value);
+}
+
+template <int Lane>
+std::int32_t GetLane(Int32x4 value) {
+  return vgetq_lane_s32(value, Lane);
+}
+
+template <int Lane>
+Int32x4 DupLane(Int32x4 value) {
+  switch (Lane) {
+    case 0:
+      return vdupq_lane_s32(vget_low_s32(value), 0);
+    case 1:
+      return vdupq_lane_s32(vget_low_s32(value), 1);
+    case 2:
+      return vdupq_lane_s32(vget_high_s32(value), 0);
+    case 3:
+      return vdupq_lane_s32(vget_high_s32(value), 1);
+    default:
+      static_assert(Lane >= 0 && Lane <= 3, "");
+      return vdupq_n_s32(0);
+  }
+}
+
+inline Int32x4 Mul(Int32x4 a, std::int32_t b) { return vmulq_n_s32(a, b); }
+
+inline Int32x4 Min(Int32x4 a, Int32x4 b) { return vminq_s32(a, b); }
+
+inline Int32x4 Max(Int32x4 a, Int32x4 b) { return vmaxq_s32(a, b); }
+
+inline Int32x4 SaturatingRoundingDoublingHighMul(Int32x4 a, std::int32_t b) {
+  return vqrdmulhq_n_s32(a, b);
+}
+
+template <int Lane>
+Int32x4 MulByRhsLane(Int32x4 a, Int32x4 b) {
+  switch (Lane) {
+    case 0:
+      return vmulq_lane_s32(a, vget_low_s32(b), 0);
+    case 1:
+      return vmulq_lane_s32(a, vget_low_s32(b), 1);
+    case 2:
+      return vmulq_lane_s32(a, vget_high_s32(b), 0);
+    case 3:
+      return vmulq_lane_s32(a, vget_high_s32(b), 1);
+    default:
+      static_assert(Lane >= 0 && Lane <= 3, "");
+      return vdupq_n_s32(0);
+  }
+}
+
+inline void MulAdd(Int32x4 lhs, Int32x4 rhs, Int32x4* acc) {
+  *acc = vmlaq_s32(*acc, lhs, rhs);
+}
+
+inline void MulAdd(Int32x4 lhs, std::int32_t rhs, Int32x4* acc) {
+  *acc = vmlaq_n_s32(*acc, lhs, rhs);
+}
+
+template <int Lane>
+inline void MulAddByRhsLane(Int32x4 lhs, Int32x4 rhs, Int32x4* acc) {
+  switch (Lane) {
+    case 0:
+      *acc = vmlaq_lane_s32(*acc, lhs, vget_low_s32(rhs), 0);
+      break;
+    case 1:
+      *acc = vmlaq_lane_s32(*acc, lhs, vget_low_s32(rhs), 1);
+      break;
+    case 2:
+      *acc = vmlaq_lane_s32(*acc, lhs, vget_high_s32(rhs), 0);
+      break;
+    case 3:
+      *acc = vmlaq_lane_s32(*acc, lhs, vget_high_s32(rhs), 1);
+      break;
+    default:
+      static_assert(Lane >= 0 && Lane <= 3, "");
+  }
+}
+
+template <>
+struct LoadContiguousImpl<RegBlockUint8<8, 8>> {
+  static RegBlockUint8<8, 8> Run(const std::uint8_t* src) {
+    RegBlockUint8<8, 8> result;
+    for (int i = 0; i < 8; i++) {
+      result.buf.reg[i] = vld1_u8(src + 8 * i);
+    }
+    return result;
+  }
+};
+
+template <>
+struct LoadContiguousImpl<RegBlockInt32<8, 8>> {
+  static RegBlockInt32<8, 8> Run(const std::int32_t* src) {
+    RegBlockInt32<8, 8> result;
+    for (int i = 0; i < 16; i++) {
+      result.buf.reg[i] = vld1q_s32(src + 4 * i);
+    }
+    return result;
+  }
+};
+
+}  // end namespace gemmlowp
+
+#include "simd_wrappers_common_neon_sse.h"
+
+#endif  // GEMMLOWP_INTERNAL_SIMD_WRAPPERS_NEON_H_
diff --git a/internal/simd_wrappers_sse.h b/internal/simd_wrappers_sse.h
new file mode 100644
index 0000000..6480b66
--- /dev/null
+++ b/internal/simd_wrappers_sse.h
@@ -0,0 +1,123 @@
+// Copyright 2017 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// simd_wrappers_neon.h: SSE SIMD wrappers
+
+#ifndef GEMMLOWP_INTERNAL_SIMD_WRAPPERS_SSE_H_
+#define GEMMLOWP_INTERNAL_SIMD_WRAPPERS_SSE_H_
+
+#include <smmintrin.h>
+
+namespace gemmlowp {
+
+using Int32x4 = __m128i;
+using Uint8x16 = __m128i;
+
+template <int ScalarCount>
+struct RegisterType<std::int32_t, ScalarCount> {
+  using Type =
+      typename std::conditional<ScalarCount >= 4, Int32x4, std::int32_t>::type;
+};
+
+template <int ScalarCount>
+struct RegisterType<std::uint8_t, ScalarCount> {
+  using Type = typename std::conditional<
+      ScalarCount >= 16, Uint8x16,
+      typename std::conditional<ScalarCount >= 4, std::uint32_t,
+                                std::uint8_t>::type>::type;
+};
+
+inline Int32x4 LoadInt32x4(const std::int32_t* src) {
+  return _mm_loadu_si128(reinterpret_cast<const Int32x4*>(src));
+}
+
+inline void StoreInt32x4(std::int32_t* dst, Int32x4 value) {
+  _mm_storeu_si128(reinterpret_cast<__m128i*>(dst), value);
+}
+
+inline Uint8x16 LoadUint8x16(const std::uint8_t* src) {
+  return _mm_loadu_si128(reinterpret_cast<const Uint8x16*>(src));
+}
+
+inline void StoreUint8x16(std::uint8_t* dst, Uint8x16 value) {
+  _mm_storeu_si128(reinterpret_cast<__m128i*>(dst), value);
+}
+
+template <int Lane>
+std::int32_t GetLane(Int32x4 value) {
+  return _mm_extract_epi32(value, Lane);
+}
+
+template <int Lane>
+Int32x4 DupLane(Int32x4 value) {
+  return _mm_shuffle_epi32(value, _MM_SHUFFLE(Lane, Lane, Lane, Lane));
+}
+
+inline Int32x4 Mul(Int32x4 a, std::int32_t b) {
+  return Mul(a, Dup<Int32x4>(b));
+}
+
+inline Int32x4 Min(Int32x4 a, Int32x4 b) { return _mm_min_epi32(a, b); }
+
+inline Int32x4 Max(Int32x4 a, Int32x4 b) { return _mm_max_epi32(a, b); }
+
+inline Int32x4 SaturatingRoundingDoublingHighMul(Int32x4 a, std::int32_t b) {
+  return SaturatingRoundingDoublingHighMul(a, Dup<Int32x4>(b));
+}
+
+template <int Lane>
+Int32x4 MulByRhsLane(Int32x4 a, Int32x4 b) {
+  return Mul(a, DupLane<Lane>(b));
+}
+
+inline void MulAdd(Int32x4 lhs, Int32x4 rhs, Int32x4* acc) {
+  *acc = Add(*acc, Mul(lhs, rhs));
+}
+
+inline void MulAdd(Int32x4 lhs, std::int32_t rhs, Int32x4* acc) {
+  *acc = Add(*acc, Mul(lhs, rhs));
+}
+
+template <int Lane>
+inline void MulAddByRhsLane(Int32x4 lhs, Int32x4 rhs, Int32x4* acc) {
+  *acc = Add(*acc, MulByRhsLane<Lane>(lhs, rhs));
+}
+
+template <>
+struct LoadContiguousImpl<RegBlockUint8<8, 8>> {
+  static RegBlockUint8<8, 8> Run(const std::uint8_t* src) {
+    RegBlockUint8<8, 8> result;
+    for (int i = 0; i < 4; i++) {
+      result.buf.reg[i] = LoadUint8x16(src + 16 * i);
+    }
+    return result;
+  }
+};
+
+template <>
+struct LoadContiguousImpl<RegBlockInt32<8, 8>> {
+  static RegBlockInt32<8, 8> Run(const std::int32_t* src) {
+    RegBlockInt32<8, 8> result;
+    for (int i = 0; i < 16; i++) {
+      result.buf.reg[i] = LoadInt32x4(src + 4 * i);
+    }
+    return result;
+  }
+};
+
+}  // end namespace gemmlowp
+
+#include "simd_wrappers_common_neon_sse.h"
+
+#endif  // GEMMLOWP_INTERNAL_SIMD_WRAPPERS_SSE_H_
diff --git a/internal/single_thread_gemm.h b/internal/single_thread_gemm.h
index f40ba55..3d430c5 100644
--- a/internal/single_thread_gemm.h
+++ b/internal/single_thread_gemm.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -28,20 +28,36 @@
 #include "pack.h"
 #include "unpack.h"
 
+#ifdef GEMMLOWP_PROFILING_SIZES
+#ifndef GEMMLOWP_PROFILING
+#error GEMMLOWP_PROFILING_SIZES without GEMMLOWP_PROFILING
+#endif
+#include <string>
+#include <unordered_map>
+#endif
+
 namespace gemmlowp {
 
 class SingleThreadGemmContext {
  public:
   Allocator* allocator() { return &allocator_; }
 
+  void set_l1_bytes_to_use(int n) { l1_bytes_to_use_ = n; }
+  void set_l2_bytes_to_use(int n) { l2_bytes_to_use_ = n; }
+  void set_l2_rhs_factor(float n) { l2_rhs_factor_ = n; }
+
+  int l1_bytes_to_use() const { return l1_bytes_to_use_; }
+  int l2_bytes_to_use() const { return l2_bytes_to_use_; }
+  float l2_rhs_factor() const { return l2_rhs_factor_; }
+
  protected:
   Allocator allocator_;
-};
 
-typedef VectorMap<const int32_t, VectorShape::Col> OffsetColMap;
-typedef VectorMap<const int32_t, VectorShape::Row> OffsetRowMap;
-typedef VectorDup<const int32_t, VectorShape::Col> OffsetColDup;
-typedef VectorDup<const int32_t, VectorShape::Row> OffsetRowDup;
+  // The cache configurationt to use.
+  int l1_bytes_to_use_ = kDefaultL1CacheSize;
+  int l2_bytes_to_use_ = kDefaultL2CacheSize;
+  float l2_rhs_factor_ = kDefaultL2RhsFactor;
+};
 
 template <typename KernelFormat, typename InputScalar, typename OutputScalar,
           typename BitDepthParams, MapOrder LhsOrder, MapOrder RhsOrder,
@@ -62,49 +78,75 @@
   int cols = result->cols();
   int depth = lhs.cols();
 
+  // zero sizes should have been caught earlier and early-returned.
   assert(rows > 0);
   assert(cols > 0);
   assert(depth > 0);
 
+  // The case of rows<cols should have been caught earlier and transposed.
+  assert(rows >= cols);
+
   Allocator* allocator = context->allocator();
 
   BlockParams block_params;
-  block_params.Init<KernelFormat>(rows, cols, depth, 1);
+  block_params.Init<KernelFormat>(rows, cols, depth, 1,
+                                  context->l1_bytes_to_use(),
+                                  context->l2_bytes_to_use(),
+                                  context->l2_rhs_factor());
 
-  PackedSideBlock<typename KernelFormat::Lhs> packed_lhs(
-      Side::Lhs, allocator, block_params);
-  PackedSideBlock<typename KernelFormat::Rhs> packed_rhs(
-      Side::Rhs, allocator, block_params);
+#ifdef GEMMLOWP_PROFILING_SIZES
+  // Using a static map of label strings. Not reentrant at all!
+  static std::unordered_map<std::uint64_t, std::string> labels_map;
+  std::uint64_t sizes_hash = static_cast<std::uint64_t>(rows) ^
+                             (static_cast<std::uint64_t>(depth) << 16) ^
+                             (static_cast<std::uint64_t>(cols) << 32);
+  if (!labels_map.count(sizes_hash)) {
+    char label[256];
+    snprintf(label, sizeof(label),
+             "(rows = %d, depth = %d, cols = %d, l2_rows = %d, l2_depth = %d, "
+             "l2_cols = %d, l1_rows = %d, l1_depth = %d, l1_cols = %d)",
+             rows, depth, cols, block_params.l2_rows, block_params.l2_depth,
+             block_params.l2_cols, block_params.l1_rows, block_params.l1_depth,
+             block_params.l1_cols);
+    labels_map[sizes_hash] = label;
+  }
+  ScopedProfilingLabel size_label(labels_map[sizes_hash].c_str());
+#endif
+
+  PackedSideBlock<typename KernelFormat::Lhs> packed_lhs(Side::Lhs, allocator,
+                                                         block_params);
+  PackedSideBlock<typename KernelFormat::Rhs> packed_rhs(Side::Rhs, allocator,
+                                                         block_params);
 
   PackedResult packed_result(allocator, block_params);
 
   allocator->Commit();
 
-  const bool pack_rhs_once = block_params.l2_cols == cols;
+  const bool pack_rhs_once = block_params.l2_cols >= cols;
 
   if (pack_rhs_once) {
-    PackRhs<BitDepthParams>(&packed_rhs, rhs);
+    PackRhs(&packed_rhs, rhs);
   }
 
   for (int r = 0; r < rows; r += block_params.l2_rows) {
     int rs = std::min(block_params.l2_rows, rows - r);
 
-    PackLhs<BitDepthParams>(&packed_lhs, lhs.block(r, 0, rs, depth));
+    PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));
 
     for (int c = 0; c < cols; c += block_params.l2_cols) {
       int cs = std::min(block_params.l2_cols, cols - c);
 
       if (!pack_rhs_once) {
-        PackRhs<BitDepthParams>(&packed_rhs, rhs.block(0, c, depth, cs));
+        PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
       }
 
-      Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);
+      Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs,
+              depth);
 
-      auto result_block = result->block(r, c, rs, cs);
-      UnpackResult<BitDepthParams>(&result_block, packed_result, depth,
-                                   packed_lhs.sums_of_each_slice(),
-                                   packed_rhs.sums_of_each_slice(),
-                                   lhs_offset, rhs_offset, output_pipeline);
+      UnpackResult<KernelFormat>(
+          result, MatrixBlockBounds(r, c, rs, cs), packed_result, depth,
+          packed_lhs.sums_of_each_slice(), packed_rhs.sums_of_each_slice(),
+          lhs_offset.block(r, rs), rhs_offset.block(c, cs), output_pipeline);
     }
   }
 
diff --git a/internal/unpack.h b/internal/unpack.h
index e25372a..33aee13 100644
--- a/internal/unpack.h
+++ b/internal/unpack.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -55,110 +55,224 @@
   const BlockParams& block_params_;
 };
 
-template <std::uint32_t numerator, std::uint32_t denominator>
-std::int32_t RoundingMultiplyByConstantFraction(std::int32_t x) {
-  if (numerator == denominator) {
-    return x;
+struct MatrixBlockBounds {
+  int start_row;
+  int start_col;
+  int rows;
+  int cols;
+
+  MatrixBlockBounds(int start_row_, int start_col_, int rows_, int cols_)
+      : start_row(start_row_),
+        start_col(start_col_),
+        rows(rows_),
+        cols(cols_) {}
+};
+
+template <int Rows, int Cols, typename SrcMapType>
+void PrefetchResultBlock(const SrcMapType& src,
+                         const VectorMap<const std::int32_t, VectorShape::Col>&
+                             lhs_sums_of_each_slice,
+                         int src_row, int src_col) {
+  const std::int32_t* src_data = src.data(src_row, src_col);
+  const int src_stride = src.stride();
+  const std::int32_t* lhs_sums_data = lhs_sums_of_each_slice.data(src_row);
+  for (int r = 0; r < Rows; r += 4) {
+    Prefetch(lhs_sums_data + r);
   }
-
-  // We'll use only signed arithmetic here. This is
-  // simpler (since this function operates on signed int32's) and
-  // more friendly to ARM NEON, where this allows us to use the
-  // VQRDMULH instruction.
-  static const std::int32_t int_quotient =
-      (numerator + denominator / 2) / denominator;
-  static const std::int32_t remaining_numerator =
-      numerator - int_quotient * denominator;
-  static const std::int32_t scaled_remaining_numerator =
-      static_cast<std::int32_t>(
-          (static_cast<std::int64_t>(remaining_numerator) * (1ll << 31)) /
-          denominator);
-
-  const std::int64_t scaled_remaining_product =
-      static_cast<std::int64_t>(x) *
-      static_cast<std::int64_t>(scaled_remaining_numerator);
-
-  const std::int32_t scaled_remaining_product_nudge =
-      (scaled_remaining_product > 0 ? 1 : -1) * (1 << 30);
-
-  const std::int32_t remaining_product = static_cast<std::int32_t>(
-      (scaled_remaining_product + scaled_remaining_product_nudge) / (1u << 31));
-
-  return x * int_quotient + remaining_product;
+  for (int c = 0; c < Cols; c++) {
+    for (int r = 0; r < Rows; r += 4) {
+      Prefetch(src_data + r + c * src_stride);
+    }
+  }
 }
 
-template <typename BitDepthParams, typename ResultBlockType,
+template <typename KernelFormat, typename RegisterBlockType,
+          typename SrcMapType, typename LhsOffset, typename RhsOffset,
+          typename OutputPipelineExecutorType, typename DstType>
+void UnpackResultBlock(const SrcMapType& src,
+                       const OutputPipelineExecutorType& executor, DstType* dst,
+                       const VectorMap<const std::int32_t, VectorShape::Col>&
+                           lhs_sums_of_each_slice,
+                       const VectorMap<const std::int32_t, VectorShape::Row>&
+                           rhs_sums_of_each_slice,
+                       const LhsOffset& lhs_offset, const RhsOffset& rhs_offset,
+                       int depth, int src_row, int src_col, int src_global_row,
+                       int src_global_col, int dst_row, int dst_col) {
+  using KernelLhsScalar = typename KernelFormat::Lhs::Scalar;
+  using KernelRhsScalar = typename KernelFormat::Rhs::Scalar;
+  static constexpr int KernelLhsZeroPointInput =
+      ZeroPointInputValue<KernelLhsScalar>::kValue;
+  static constexpr int KernelRhsZeroPointInput =
+      ZeroPointInputValue<KernelRhsScalar>::kValue;
+  auto acc = Load<RegisterBlockType>(src, src_row, src_col);
+  const auto& lhs_sums_of_each_slice_block =
+      LoadForBroadcasting<RegisterBlockType>(lhs_sums_of_each_slice, src_row);
+  const auto& rhs_sums_of_each_slice_block =
+      LoadForBroadcasting<RegisterBlockType>(rhs_sums_of_each_slice, src_col);
+  auto lhs_offset_block =
+      LoadForBroadcasting<RegisterBlockType>(lhs_offset, src_row);
+  auto rhs_offset_block =
+      LoadForBroadcasting<RegisterBlockType>(rhs_offset, src_col);
+  AddConstant<KernelLhsZeroPointInput>(&lhs_offset_block);
+  AddConstant<KernelRhsZeroPointInput>(&rhs_offset_block);
+  BroadcastMulAdd(lhs_sums_of_each_slice_block, rhs_offset_block, &acc);
+  for (int i = 0; i < decltype(rhs_offset_block)::kRegisterCount; i++) {
+    rhs_offset_block.buf.reg[i] = Mul(rhs_offset_block.buf.reg[i], depth);
+  }
+  BroadcastMulAdd(BroadcastAdd(rhs_sums_of_each_slice_block, rhs_offset_block),
+                  lhs_offset_block, &acc);
+  executor.Execute(acc, dst, src_global_row, src_global_col, dst_row, dst_col);
+}
+
+template <typename KernelFormat, typename ResultBlockType,
           typename PackedResultType, typename LhsOffset, typename RhsOffset,
           typename OutputPipelineType>
-struct UnpackResultImplGeneric {
-  static void Unpack(ResultBlockType* dst, const PackedResultType& src,
-                     int depth, const std::int32_t* lhs_sums_of_each_slice,
-                     const std::int32_t* rhs_sums_of_each_slice,
-                     const LhsOffset& lhs_offset, const RhsOffset& rhs_offset,
-                     const OutputPipelineType& output_pipeline) {
-    auto src_map = src.Map();
-    // No top-level blocking in the depth dimension at the moment.
-    // Too much loss of precision.
-    const int kLhsBits = BitDepthParams::LhsBitDepth::kBits;
-    const int kRhsBits = BitDepthParams::RhsBitDepth::kBits;
-    const std::int32_t kLhsMax = (1 << kLhsBits) - 1;
-    const std::int32_t kRhsMax = (1 << kRhsBits) - 1;
-    OutputPipelineExecutor<OutputPipelineType, FragmentInt32x1x1>
-        output_pipeline_executor(output_pipeline);
-    for (int c = 0; c < dst->cols(); c++) {
-      for (int r = 0; r < dst->rows(); r++) {
-        // To understand this code, read
-        //   doc/low-precision.txt
-        //   doc/less-than-8-bit.txt
-        // We have 4 terms to sum: xx, x1, 1x, 11.
-        // In case of requantization, we first need to scale them back
-        // to the original scale, using RoundingMultiplyByConstantFraction.
-        std::int32_t raw_xx = src_map(r, c);
-        std::int32_t raw_x1 = lhs_sums_of_each_slice[r] * rhs_offset(c);
-        std::int32_t raw_1x = rhs_sums_of_each_slice[c] * lhs_offset(r);
-        std::int32_t term_xx =
-            RoundingMultiplyByConstantFraction<255 * 255, kLhsMax * kRhsMax>(
-                raw_xx);
-        std::int32_t term_x1 =
-            RoundingMultiplyByConstantFraction<255, kLhsMax>(raw_x1);
-        std::int32_t term_1x =
-            RoundingMultiplyByConstantFraction<255, kRhsMax>(raw_1x);
-        std::int32_t term_11 = lhs_offset(r) * rhs_offset(c) * depth;
-        // Sum the 4 terms.
-        FragmentInt32x1x1 sum = term_xx + term_x1 + term_1x + term_11;
+void UnpackResult(ResultBlockType* dst, const MatrixBlockBounds& dst_block,
+                  const PackedResultType& src, int depth,
+                  const std::int32_t* lhs_sums_of_each_slice_ptr,
+                  const std::int32_t* rhs_sums_of_each_slice_ptr,
+                  const LhsOffset& lhs_offset, const RhsOffset& rhs_offset,
+                  const OutputPipelineType& output_pipeline) {
+  ScopedProfilingLabel label(ResultBlockType::kOrder == MapOrder::ColMajor
+                                 ? "unpack to column-major"
+                                 : "unpack to row-major");
+  assert(dst_block.start_row >= 0);
+  assert(dst_block.start_row + dst_block.rows <= dst->rows());
+  assert(dst_block.start_col >= 0);
+  assert(dst_block.start_col + dst_block.cols <= dst->cols());
+  const auto src_map = src.Map();
+  const VectorMap<const std::int32_t, VectorShape::Col> lhs_sums_of_each_slice(
+      lhs_sums_of_each_slice_ptr, dst_block.rows);
+  const VectorMap<const std::int32_t, VectorShape::Row> rhs_sums_of_each_slice(
+      rhs_sums_of_each_slice_ptr, dst_block.cols);
+  using Int32x1x1 = RegisterBlock<std::int32_t, 1, 1>;
+  using Int32x4x1 = RegisterBlock<std::int32_t, 4, 1>;
+  using Int32x8x1 = RegisterBlock<std::int32_t, 8, 1>;
+  using Int32x1x4 = RegisterBlock<std::int32_t, 1, 4>;
+  using Int32x4x4 = RegisterBlock<std::int32_t, 4, 4>;
+  using Int32x8x4 = RegisterBlock<std::int32_t, 8, 4>;
 
-        output_pipeline_executor.Execute(sum, dst, r, c);
+  using DstScalarType = typename ResultBlockType::Scalar;
+  using DstScalarx8x8 = RegisterBlock<DstScalarType, 8, 8>;
+
+  OutputPipelineExecutor<OutputPipelineType, Int32x1x1>
+      output_pipeline_executor_1x1(output_pipeline);
+  OutputPipelineExecutor<OutputPipelineType, Int32x4x1>
+      output_pipeline_executor_4x1(output_pipeline);
+  OutputPipelineExecutor<OutputPipelineType, Int32x8x1>
+      output_pipeline_executor_8x1(output_pipeline);
+  OutputPipelineExecutor<OutputPipelineType, Int32x1x4>
+      output_pipeline_executor_1x4(output_pipeline);
+  OutputPipelineExecutor<OutputPipelineType, Int32x4x4>
+      output_pipeline_executor_4x4(output_pipeline);
+  OutputPipelineExecutor<OutputPipelineType, Int32x8x4>
+      output_pipeline_executor_8x4(output_pipeline);
+
+  int c8 = 0;
+  if (ResultBlockType::kOrder == MapOrder::RowMajor) {
+    for (; c8 <= dst_block.cols - 8; c8 += 8) {
+      PrefetchResultBlock<8, 8>(src_map, lhs_sums_of_each_slice, 0, c8);
+      int r = 0;
+      for (; r <= dst_block.rows - 8; r += 8) {
+        const int global_row = r + dst_block.start_row;
+        PrefetchResultBlock<8, 8>(src_map, lhs_sums_of_each_slice, r + 8, c8);
+        DstScalarType dst_colmajor_buf[64];
+        MatrixMap<DstScalarType, MapOrder::ColMajor> dst_colmajor_map(
+            dst_colmajor_buf, 8, 8);
+        for (int cx = 0; cx < 8; cx += 4) {
+          const int c = c8 + cx;
+          const int global_col = c + dst_block.start_col;
+          UnpackResultBlock<KernelFormat, Int32x8x4>(
+              src_map, output_pipeline_executor_8x4, &dst_colmajor_map,
+              lhs_sums_of_each_slice, rhs_sums_of_each_slice, lhs_offset,
+              rhs_offset, depth, r, c, global_row, global_col, 0, cx);
+        }
+        StoreFinalOutput(LoadContiguous<DstScalarx8x8>(dst_colmajor_buf), dst,
+                         r + dst_block.start_row, c8 + dst_block.start_col);
+      }
+      for (; r <= dst_block.rows - 4; r += 4) {
+        const int global_row = r + dst_block.start_row;
+        for (int cx = 0; cx < 8; cx += 4) {
+          const int c = c8 + cx;
+          const int global_col = c + dst_block.start_col;
+          UnpackResultBlock<KernelFormat, Int32x4x4>(
+              src_map, output_pipeline_executor_4x4, dst,
+              lhs_sums_of_each_slice, rhs_sums_of_each_slice, lhs_offset,
+              rhs_offset, depth, r, c, global_row, global_col, global_row,
+              global_col);
+        }
+      }
+      for (; r < dst_block.rows; r++) {
+        const int global_row = r + dst_block.start_row;
+        for (int cx = 0; cx < 8; cx += 4) {
+          const int c = c8 + cx;
+          const int global_col = c + dst_block.start_col;
+          UnpackResultBlock<KernelFormat, Int32x1x4>(
+              src_map, output_pipeline_executor_1x4, dst,
+              lhs_sums_of_each_slice, rhs_sums_of_each_slice, lhs_offset,
+              rhs_offset, depth, r, c, global_row, global_col, global_row,
+              global_col);
+        }
       }
     }
   }
-};
-
-template <typename BitDepthParams, typename ResultBlockType,
-          typename PackedResultType, typename LhsOffset, typename RhsOffset,
-          typename OutputPipelineType>
-struct UnpackResultImpl
-    : UnpackResultImplGeneric<BitDepthParams, ResultBlockType, PackedResultType,
-                              LhsOffset, RhsOffset, OutputPipelineType> {};
-
-template <typename BitDepthParams, typename ResultBlockType,
-          typename PackedResultType, typename LhsOffset, typename RhsOffset,
-          typename OutputPipelineType>
-void UnpackResult(ResultBlockType* dst, const PackedResultType& src, int depth,
-                  const std::int32_t* lhs_sums_of_each_slice,
-                  const std::int32_t* rhs_sums_of_each_slice,
-                  const LhsOffset& lhs_offset, const RhsOffset& rhs_offset,
-                  const OutputPipelineType& output_pipeline) {
-  ScopedProfilingLabel label("unpack");
-  UnpackResultImpl<BitDepthParams, ResultBlockType, PackedResultType,
-                   LhsOffset, RhsOffset, OutputPipelineType>::Unpack(
-      dst, src, depth, lhs_sums_of_each_slice, rhs_sums_of_each_slice,
-      lhs_offset, rhs_offset, output_pipeline);
+  int c = c8;
+  for (; c <= dst_block.cols - 4; c += 4) {
+    const int global_col = c + dst_block.start_col;
+    PrefetchResultBlock<8, 4>(src_map, lhs_sums_of_each_slice, 0, c);
+    int r = 0;
+    for (; r <= dst_block.rows - 8; r += 8) {
+      const int global_row = r + dst_block.start_row;
+      PrefetchResultBlock<8, 4>(src_map, lhs_sums_of_each_slice, r + 8, c);
+      UnpackResultBlock<KernelFormat, Int32x8x4>(
+          src_map, output_pipeline_executor_8x4, dst, lhs_sums_of_each_slice,
+          rhs_sums_of_each_slice, lhs_offset, rhs_offset, depth, r, c,
+          global_row, global_col, global_row, global_col);
+    }
+    for (; r <= dst_block.rows - 4; r += 4) {
+      const int global_row = r + dst_block.start_row;
+      UnpackResultBlock<KernelFormat, Int32x4x4>(
+          src_map, output_pipeline_executor_4x4, dst, lhs_sums_of_each_slice,
+          rhs_sums_of_each_slice, lhs_offset, rhs_offset, depth, r, c,
+          global_row, global_col, global_row, global_col);
+    }
+    for (; r < dst_block.rows; r++) {
+      const int global_row = r + dst_block.start_row;
+      UnpackResultBlock<KernelFormat, Int32x1x4>(
+          src_map, output_pipeline_executor_1x4, dst, lhs_sums_of_each_slice,
+          rhs_sums_of_each_slice, lhs_offset, rhs_offset, depth, r, c,
+          global_row, global_col, global_row, global_col);
+    }
+  }
+  for (; c < dst_block.cols; c++) {
+    const int global_col = c + dst_block.start_col;
+    PrefetchResultBlock<8, 1>(src_map, lhs_sums_of_each_slice, 0, c);
+    int r = 0;
+    for (; r <= dst_block.rows - 8; r += 8) {
+      const int global_row = r + dst_block.start_row;
+      PrefetchResultBlock<8, 1>(src_map, lhs_sums_of_each_slice, r + 8, c);
+      UnpackResultBlock<KernelFormat, Int32x8x1>(
+          src_map, output_pipeline_executor_8x1, dst, lhs_sums_of_each_slice,
+          rhs_sums_of_each_slice, lhs_offset, rhs_offset, depth, r, c,
+          global_row, global_col, global_row, global_col);
+    }
+    for (; r <= dst_block.rows - 4; r += 4) {
+      const int global_row = r + dst_block.start_row;
+      UnpackResultBlock<KernelFormat, Int32x4x1>(
+          src_map, output_pipeline_executor_4x1, dst, lhs_sums_of_each_slice,
+          rhs_sums_of_each_slice, lhs_offset, rhs_offset, depth, r, c,
+          global_row, global_col, global_row, global_col);
+    }
+    for (; r < dst_block.rows; r++) {
+      const int global_row = r + dst_block.start_row;
+      UnpackResultBlock<KernelFormat, Int32x1x1>(
+          src_map, output_pipeline_executor_1x1, dst, lhs_sums_of_each_slice,
+          rhs_sums_of_each_slice, lhs_offset, rhs_offset, depth, r, c,
+          global_row, global_col, global_row, global_col);
+    }
+  }
 }
 
-}  // namespace gemmlowp
-
-#ifdef GEMMLOWP_NEON
-#include "unpack_neon.h"
-#endif
+}  // end namespace gemmlowp
 
 #endif  // GEMMLOWP_INTERNAL_UNPACK_H_
diff --git a/internal/unpack_neon.h b/internal/unpack_neon.h
index 394f10a..5c9e76a 100644
--- a/internal/unpack_neon.h
+++ b/internal/unpack_neon.h
@@ -73,12 +73,17 @@
                         PackedResultType, LhsOffset, RhsOffset,
                         OutputPipelineType> {
   typedef MatrixMap<OutputScalar, MapOrder::ColMajor> ResultBlockType;
-  static void Unpack(ResultBlockType* dst, const PackedResultType& src,
-                     int depth, const std::int32_t* lhs_sums_of_each_slice,
+  static void Unpack(ResultBlockType* dst, const MatrixBlockBounds& dst_block,
+                     const PackedResultType& src, int depth,
+                     const std::int32_t* lhs_sums_of_each_slice,
                      const std::int32_t* rhs_sums_of_each_slice,
                      const LhsOffset& lhs_offset, const RhsOffset& rhs_offset,
                      const OutputPipelineType& output_pipeline) {
     ScopedProfilingLabel label("optimized path (NEON)");
+    assert(dst_block.start_row >= 0);
+    assert(dst_block.start_row + dst_block.rows <= dst->rows());
+    assert(dst_block.start_col >= 0);
+    assert(dst_block.start_col + dst_block.cols <= dst->cols());
     const int kLhsBits = BitDepthParams::LhsBitDepth::kBits;
     const int kRhsBits = BitDepthParams::RhsBitDepth::kBits;
     const std::int32_t kLhsMax = (1 << kLhsBits) - 1;
@@ -91,16 +96,18 @@
     OutputPipelineExecutor<OutputPipelineType, NEONFragmentInt32x16x1>
         output_pipeline_executor_int32x16x1(output_pipeline);
 
-    for (int c = 0; c < dst->cols(); c++) {
+    for (int c = 0; c < dst_block.cols; c++) {
+      int c_dst = c + dst_block.start_col;
       const std::int32_t* src_ptr = src_map.data(0, c);
       const std::int32_t* sums_of_each_slice_ptr = lhs_sums_of_each_slice;
-      auto lhs_offset_iter = const_iterator(lhs_offset);
-      const std::int32_t rhs_offset_c = rhs_offset(c);
+      auto lhs_offset_iter = const_iterator(lhs_offset, dst_block.start_row);
+      const std::int32_t rhs_offset_c = rhs_offset(c_dst);
       const std::int32_t rhs_sums_of_each_slice_c = rhs_sums_of_each_slice[c];
 
       // Handle 16 values at once for higher performance
-      int dst_rows_aligned16 = RoundDown<16>(dst->rows());
+      int dst_rows_aligned16 = RoundDown<16>(dst_block.rows);
       for (int r = 0; r < dst_rows_aligned16; r += 16) {
+        int r_dst = r + dst_block.start_row;
         // Compute the sum of the 4 terms,
         //   q = term_xx + term_x1 + term_1x_plus_term_11
         // Refer to the generic code in unpack.h.
@@ -144,12 +151,13 @@
                                vaddq_s32(term_1x[i], term_11[i]));
         }
         NEONFragmentInt32x16x1 f(q);
-        output_pipeline_executor_int32x16x1.Execute(f, dst, r, c);
+        output_pipeline_executor_int32x16x1.Execute(f, dst, r_dst, c_dst);
       }
       // We have finished handling groups of 16 entries at once; now
       // try to handle 4 entries at once.
-      int dst_rows_aligned4 = RoundDown<4>(dst->rows());
+      int dst_rows_aligned4 = RoundDown<4>(dst_block.rows);
       for (int r = dst_rows_aligned16; r < dst_rows_aligned4; r += 4) {
+        int r_dst = r + dst_block.start_row;
         // Compute the sum of the 4 terms,
         //   q = term_xx + term_x1 + term_1x_plus_term_11
         // Refer to the generic code in unpack.h.
@@ -173,15 +181,17 @@
         int32x4_t q = vaddq_s32(vaddq_s32(term_xx, term_x1),
                                 vaddq_s32(term_1x, term_11));
         NEONFragmentInt32x4x1 f(q);
-        output_pipeline_executor_int32x4x1.Execute(f, dst, r, c);
+        output_pipeline_executor_int32x4x1.Execute(f, dst, r_dst, c_dst);
       }
       // We have finished handling 4 entries at once; now handle
       // remaining entries one by one. This scalar code is similar
       // to the code in unpack.h, see comments there.
-      for (int r = dst_rows_aligned4; r < dst->rows(); r++) {
+      for (int r = dst_rows_aligned4; r < dst_block.rows; r++) {
+        int r_dst = r + dst_block.start_row;
         const std::int32_t raw_xx = src_map(r, c);
         const std::int32_t raw_x1 = lhs_sums_of_each_slice[r] * rhs_offset_c;
-        const std::int32_t raw_1x = rhs_sums_of_each_slice_c * lhs_offset(r);
+        const std::int32_t raw_1x =
+            rhs_sums_of_each_slice_c * lhs_offset(r_dst);
         const std::int32_t term_xx =
             RoundingMultiplyByConstantFraction<255 * 255, kLhsMax * kRhsMax>(
                 raw_xx);
@@ -191,7 +201,7 @@
             RoundingMultiplyByConstantFraction<255, kRhsMax>(raw_1x);
         const std::int32_t term_11 = lhs_offset(r) * rhs_offset(c) * depth;
         FragmentInt32x1x1 sum = term_xx + term_x1 + term_1x + term_11;
-        output_pipeline_executor_int32x1x1.Execute(sum, dst, r, c);
+        output_pipeline_executor_int32x1x1.Execute(sum, dst, r_dst, c_dst);
       }
     }
   }
diff --git a/meta/base.h b/meta/base.h
new file mode 100644
index 0000000..4eeb88d
--- /dev/null
+++ b/meta/base.h
@@ -0,0 +1,145 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_BASE_H_
+#define GEMMLOWP_META_BASE_H_
+
+#include <cassert>
+#include <cstdint>
+
+#include "../internal/common.h"
+
+namespace gemmlowp {
+namespace meta {
+
+template <int align>
+inline int AlignTo(int value) {
+  return ((value + align - 1) / align) * align;
+}
+
+inline int AlignTo(int align, int value) {
+  return ((value + align - 1) / align) * align;
+}
+
+template <typename Kernel_, typename OutputStream_>
+struct FusedKernelParams {
+ public:
+  typedef Kernel_ Kernel;
+  typedef OutputStream_ OutputStream;
+
+  Kernel kernel;
+  OutputStream output_stream;
+};
+
+template <typename InType_, typename OutType_, typename LeftStream_,
+          typename RightStream_, typename Kernel_, typename OutputStream_>
+struct GemmParams {
+ public:
+  typedef InType_ InType;
+  typedef OutType_ OutType;
+  typedef LeftStream_ LeftStream;
+  typedef RightStream_ RightStream;
+  typedef Kernel_ Kernel;
+  typedef OutputStream_ OutputStream;
+
+  typedef FusedKernelParams<Kernel, OutputStream> FusedKernel;
+
+  // Common parameters.
+
+  int m;
+  int n;
+  int k;
+
+  const InType* lhs;
+  const InType* rhs;
+  OutType* result;
+  std::uint8_t* scratch;
+
+  // Specialized parameters.
+
+  LeftStream left_stream;
+  RightStream right_stream;
+  FusedKernel fused_kernel;
+};
+
+template <typename InType, int lanes_count, int pack_size, int leftovers,
+          typename StreamParams>
+class Stream {
+ public:
+  static void Pack(const InType* in, const StreamParams& params, InType* out);
+
+  static int UnpackedAdvance(const StreamParams& params);
+
+  static int PackedAdvance(const StreamParams& params);
+
+  static int UnpackedStride(const StreamParams& params);
+
+  static int PackedStride(const StreamParams& params);
+};
+
+template <typename InType, typename StreamType>
+class StreamUtil {
+ public:
+  static const InType* Offset(const StreamType& params, const InType* source,
+                              int offset_stride, int offset_advance);
+
+  static int Scratch(const StreamType& params, int lanes);
+};
+
+template <typename InType, typename OutType, typename Kernel,
+          typename OutputStream, int kernel_m, int kernel_n, int pack_size>
+class MulKernel {
+ public:
+  static void Multiply(const InType* lhs, const InType* rhs,
+                       const FusedKernelParams<Kernel, OutputStream>& params,
+                       OutType* result);
+};
+
+template <typename InType_, typename OutType_, typename Kernel_>
+struct Transform1DParams {
+  typedef InType_ InType;
+  typedef OutType_ OutType;
+  typedef Kernel_ Kernel;
+
+  const InType* input;
+  OutType* output;
+  std::uint8_t* scratch;
+
+  Kernel kernel;
+};
+
+template <typename InType, typename OutType, typename Kernel, int kernel_size,
+          int leftovers>
+class Transform1DKernel {
+ public:
+  static void Transform(const InType* input, const Kernel& params,
+                        OutType* output);
+};
+
+template <typename InType, typename OutType, typename Transform>
+class Transform1DUtil {
+ public:
+  static int EstimateComputeCost(const Transform& params);
+
+  static const InType* OffsetInput(const Transform& params, const InType* input,
+                                   int offset);
+
+  static OutType* OffsetOutput(const Transform& params, OutType* output,
+                               int offset);
+};
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#endif  // GEMMLOWP_META_BASE_H_
diff --git a/meta/generators/cc_emitter.py b/meta/generators/cc_emitter.py
index cbb5fdb..8615671 100644
--- a/meta/generators/cc_emitter.py
+++ b/meta/generators/cc_emitter.py
@@ -1,3 +1,16 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """CC code emitter.
 
 Used by generators to programatically prepare C++ code. Contains some simple
@@ -18,6 +31,10 @@
   """Invalid cc header structure."""
 
 
+class ClassError(Error):
+  """Invalid class syntax."""
+
+
 class CCEmitter(object):
   """Emits c++ code."""
 
@@ -25,6 +42,7 @@
     self.indent = ''
     self.debug = debug
     self.namespaces = []
+    self.classes = []
     self.header_name = None
 
   def PushIndent(self):
@@ -57,7 +75,9 @@
   def EmitBinaryOp(self, operand_1, op, operand_2):
     self.EmitCode('%s %s %s' % (operand_1, op, operand_2))
 
-  def EmitCall(self, function, params=[]):
+  def EmitCall(self, function, params=None):
+    if not params:
+      params = []
     self.EmitCode('%s(%s)' % (function, ', '.join(map(str, params))))
 
   def EmitCode(self, code):
@@ -94,7 +114,22 @@
                            ' // %s' % (self.header_name + '_H_').upper())
     self.header_name = None
 
-  def EmitFunctionBeginA(self, function_name, params, return_type):
+  def EmitMemberFunctionBegin(self, class_name, class_template_params,
+                              class_specializations, function_name,
+                              function_params, return_type):
+    """Emit member function of a template/specialized class."""
+    if class_template_params or class_specializations:
+      self.EmitIndented('template<%s>' % ', '.join(class_template_params))
+
+    if class_specializations:
+      class_name += '<%s>' % ', '.join(map(str, class_specializations))
+
+    self.EmitIndented('%s %s::%s(%s) {' % (
+        return_type, class_name, function_name,
+        ', '.join(['%s %s' % (t, n) for (t, n) in function_params])))
+    self.PushIndent()
+
+  def EmitFunctionBegin(self, function_name, params, return_type):
     self.EmitIndented('%s %s(%s) {' %
                       (return_type, function_name,
                        ', '.join(['%s %s' % (t, n) for (t, n) in params])))
@@ -103,6 +138,37 @@
   def EmitFunctionEnd(self):
     self.PopIndent()
     self.EmitIndented('}')
+    self.EmitNewline()
+
+  def EmitClassBegin(self, class_name, template_params, specializations,
+                     base_classes):
+    """Emit class block header."""
+    self.classes.append(class_name)
+    if template_params or specializations:
+      self.EmitIndented('template<%s>' % ', '.join(template_params))
+
+    class_name_extended = class_name
+    if specializations:
+      class_name_extended += '<%s>' % ', '.join(map(str, specializations))
+    if base_classes:
+      class_name_extended += ' : ' + ', '.join(base_classes)
+    self.EmitIndented('class %s {' % class_name_extended)
+    self.PushIndent()
+
+  def EmitClassEnd(self):
+    if not self.classes:
+      raise ClassError('No class on stack.')
+    self.classes.pop()
+    self.PopIndent()
+    self.EmitIndented('};')
+    self.EmitNewline()
+
+  def EmitAccessModifier(self, modifier):
+    if not self.classes:
+      raise ClassError('No class on stack.')
+    self.PopIndent()
+    self.EmitIndented(' %s:' % modifier)
+    self.PushIndent()
 
   def EmitNamespaceBegin(self, namespace):
     self.EmitCodeNoSemicolon('namespace %s {' % namespace)
diff --git a/meta/generators/common.py b/meta/generators/common.py
new file mode 100644
index 0000000..d680372
--- /dev/null
+++ b/meta/generators/common.py
@@ -0,0 +1,136 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""."""
+
+_HEADER_COPYRIGHT = (
+    '''// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+''')
+
+
+def GenerateHeader(cc, header_name, preprocessor_directive):
+  cc.EmitCodeNoSemicolon(_HEADER_COPYRIGHT)
+  cc.EmitHeaderBegin(header_name)
+
+  cc.EmitPreprocessor1('ifdef', preprocessor_directive)
+  cc.EmitNewline()
+
+  cc.EmitInclude('<cassert>')
+  cc.EmitInclude('<cstdint>')
+  cc.EmitNewline()
+
+
+def GenerateFooter(cc, message):
+  cc.EmitPreprocessor('else')
+  cc.EmitPreprocessor1('warning', '"%s"' % message)
+  cc.EmitPreprocessor('endif')
+  cc.EmitNewline()
+  cc.EmitHeaderEnd()
+
+
+def GenerateDebugLog(cc, message):
+  cc.EmitPreprocessor1('ifdef', 'DEBUG')
+  cc.EmitPreprocessor1('ifdef', 'DEBUG_METAGEMM_VERBOSE')
+  cc.EmitCode('std::cout << __FILE__ << \"(\" << __LINE__ << \") %s\" '
+              '<< std::endl << std::flush' % message)
+  cc.EmitPreprocessor('endif')
+  cc.EmitPreprocessor('endif')
+
+
+def _TemplateName(base, params):
+  return '%s<%s>' % (base, ', '.join(map(str, params)))
+
+
+class StreamGenerator(object):
+  """."""
+
+  def __init__(self, emitter, name):
+    self.name = name
+    self.emitter = emitter
+
+  def SpecializeStream(self, in_type, lanes_count, pack_size, leftovers):
+    if callable(getattr(self, 'EmitPack', None)):
+      template_params = [in_type, lanes_count, pack_size, leftovers, self.name]
+      self.emitter.EmitMemberFunctionBegin(
+          'Stream', [], template_params, 'Pack',
+          [['const %s*' % in_type, 'in'], ['const %s&' % self.name, 'params'],
+           ['%s*' % in_type, 'out']], 'inline void')
+      GenerateDebugLog(self.emitter,
+                       '%s::Pack()' % _TemplateName(self.name, template_params))
+      self.EmitPack(in_type, lanes_count, pack_size, leftovers)
+      self.emitter.EmitFunctionEnd()
+
+
+class MulKernelGenerator(object):
+  """."""
+
+  def __init__(self, emitter, kernel_name, output_stream_name):
+    self.kernel_name = kernel_name
+    self.output_stream_name = output_stream_name
+    self.emitter = emitter
+
+  def SpecializeMulKernel(self, in_type, out_type, kernel_m, kernel_n,
+                          pack_size):
+    """Generates the kernel wrapped in a MulKernel template specialization."""
+    template_params = [
+        in_type, out_type, self.kernel_name, self.output_stream_name, kernel_m,
+        kernel_n, pack_size
+    ]
+    self.emitter.EmitMemberFunctionBegin(
+        'MulKernel', [], template_params, 'Multiply',
+        [['const %s*' % in_type, 'lhs'], ['const %s*' % in_type, 'rhs'], [
+            'const FusedKernelParams<%s, %s>&' % (self.kernel_name,
+                                                  self.output_stream_name),
+            'params'
+        ], ['%s*' % out_type, 'result']], 'inline void')
+    GenerateDebugLog(self.emitter, '%s::Multiply()' %
+                     _TemplateName(self.kernel_name + self.output_stream_name,
+                                   template_params))
+    self.EmitMultiply(in_type, out_type, kernel_m, kernel_n, pack_size)
+    self.emitter.EmitFunctionEnd()
+
+
+class Transform1DKernelGenerator(object):
+  """."""
+
+  def __init__(self, emitter, kernel_name):
+    self.kernel_name = kernel_name
+    self.emitter = emitter
+
+  def SpecializeTransform1DKernel(self, in_type, out_type, kernel_size,
+                                  leftovers):
+    """Generates the kernel wrapped in a Transform1DKernel specialization."""
+    template_params = [
+        in_type, out_type, self.kernel_name, kernel_size, leftovers
+    ]
+    self.emitter.EmitMemberFunctionBegin(
+        'Transform1DKernel', [], template_params, 'Transform',
+        [['const %s*' % in_type, 'input'],
+         ['const %s&' % self.kernel_name, 'params'],
+         ['%s*' % out_type, 'output']], 'inline void')
+    GenerateDebugLog(self.emitter, '%s::Transform()' %
+                     _TemplateName(self.kernel_name, template_params))
+    self.EmitTransform(in_type, out_type, kernel_size, leftovers)
+    self.emitter.EmitFunctionEnd()
diff --git a/meta/generators/gemm_NxMxK_neon.py b/meta/generators/gemm_NxMxK_neon.py
index 5ba00a1..baa366b 100644
--- a/meta/generators/gemm_NxMxK_neon.py
+++ b/meta/generators/gemm_NxMxK_neon.py
@@ -1,30 +1,9 @@
-"""Generates the whole gemm header.
+"""Generates the specialized gemm functions."""
 
-"""
-
-import cc_emitter
 import mul_Nx8_Mx8_neon
-import neon_emitter
 import qnt_Nx8_neon
 import zip_Nx8_neon
 
-_HEADER_COPYRIGHT = """// Copyright 2015 Google Inc. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-//
-// single_thread_gemm.h: programatically generated GEMM library header.
-"""
-
 _QUANTIZED_8BIT = 'quantized_8bit'
 _FULL_32BIT = 'full_32bit'
 _FULL_FLOAT = 'full_float'
@@ -158,7 +137,7 @@
   GenerateMulRows(emitter, 'temp_result', 'int32', False, True, aligned, 3,
                   cols, leftovers)
   emitter.EmitCall(
-      BuildMultiQuantizeName(aligned, 3),
+      qnt_Nx8_neon.BuildMultiQuantizeName(aligned, 3),
       ['temp_result', 'n', 'mul_result_chunk_stride_bytes',
        'zipped_lhs_3_offsets', 'result_chunk', 'result_stride',
        'multiplicative_offset', 'rounding_offset', '-shift'])
@@ -171,7 +150,7 @@
     GenerateMulRows(emitter, 'temp_result', 'int32', False, True, aligned, rows,
                     cols, leftovers)
     emitter.EmitCall(
-        BuildMultiQuantizeName(aligned, rows),
+        qnt_Nx8_neon.BuildMultiQuantizeName(aligned, rows),
         ['temp_result', 'n', 'mul_result_chunk_stride_bytes',
          'zipped_lhs_%d_offsets' % rows, 'result_chunk', 'result_stride',
          'multiplicative_offset', 'rounding_offset', '-shift'])
@@ -255,40 +234,6 @@
   emitter.EmitFunctionEnd()
 
 
-def BuildMultiQuantizeName(aligned, rows):
-  name = 'multi_qnt_%dx8' % rows
-  if aligned:
-    name = '%s_aligned' % name
-  return name
-
-
-def GenerateMultiQuantize(emitter, aligned, rows):
-  """Emit main quantization code that switches between optimized versions."""
-  name = BuildMultiQuantizeName(aligned, rows)
-  emitter.EmitFunctionBeginA(
-      name,
-      [['const std::int32_t*', 'source'], ['std::int32_t', 'count'],
-       ['std::int32_t', 'stride'], ['const std::int32_t*', 'offsets'],
-       ['std::uint8_t*', 'destination'], ['std::int32_t', 'destination_stride'],
-       ['std::int32_t', 'multiplicative_offset'],
-       ['std::int32_t', 'rounding_offset'], ['std::int32_t', 'shift']], 'void')
-  emitter.EmitSwitch('count % 8')
-
-  for leftovers in range(0, 8):
-    emitter.EmitCase(leftovers)
-    emitter.PushIndent()
-    emitter.EmitCall(
-        qnt_Nx8_neon.BuildName(rows, leftovers, aligned),
-        ['source', 'count', 'stride', 'offsets', 'destination',
-         'destination_stride', 'multiplicative_offset', 'rounding_offset',
-         'shift'])
-    emitter.EmitBreak()
-    emitter.PopIndent()
-
-  emitter.EmitSwitchEnd()
-  emitter.EmitFunctionEnd()
-
-
 def GenerateGemmCall(emitter, output_type, aligned, m_mod, n_mod, leftovers):
   emitter.EmitCall(
       emitter.Scope('internal',
@@ -396,29 +341,6 @@
 
 def GenerateInternalFunctions(emitter):
   """Generate all the functions hidden in the internal namespace."""
-  zip_Nx8_neon.GenerateFunctions(neon_emitter.NeonEmitter())
-  emitter.EmitNewline()
-
-  mul_Nx8_Mx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), 'int32', False,
-                                     True)
-  emitter.EmitNewline()
-
-  mul_Nx8_Mx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), 'int32', True,
-                                     True)
-  emitter.EmitNewline()
-
-  mul_Nx8_Mx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), 'float', True,
-                                     True)
-  emitter.EmitNewline()
-
-  qnt_Nx8_neon.GenerateFunctions(neon_emitter.NeonEmitter())
-  emitter.EmitNewline()
-
-  for aligned in [True, False]:
-    for rows in range(1, 4):
-      GenerateMultiQuantize(emitter, aligned, rows)
-      emitter.EmitNewline()
-
   for output_type in [_QUANTIZED_8BIT, _FULL_32BIT, _FULL_FLOAT]:
     for aligned in [True, False]:
       for rows in range(0, 3):
@@ -428,54 +350,10 @@
             emitter.EmitNewline()
 
 
-def Main():
-  """Generate the single threaded meta gemm library."""
-  emitter = cc_emitter.CCEmitter()
+def GeneratePublicFunctions(emitter):
+  for output_type in [_QUANTIZED_8BIT, _FULL_32BIT, _FULL_FLOAT]:
+    GenerateMainGemmFunction(emitter, output_type)
+    emitter.EmitNewline()
 
-  emitter.EmitCodeNoSemicolon(_HEADER_COPYRIGHT)
-  emitter.EmitHeaderBegin('gemmlowp_meta_single_thread_gemm')
-
-  emitter.EmitPreprocessor1('ifdef', 'GEMMLOWP_NEON_32')
-  emitter.EmitNewline()
-
-  emitter.EmitInclude('<cassert>')
-  emitter.EmitNewline()
-
-  emitter.EmitNamespaceBegin('gemmlowp')
-  emitter.EmitNamespaceBegin('meta')
-  emitter.EmitNamespaceBegin('internal')
-  emitter.EmitNewline()
-
-  GenerateInternalFunctions(emitter)
-
-  emitter.EmitNamespaceEnd()
-  emitter.EmitNewline()
-
-  GenerateMainGemmFunction(emitter, _QUANTIZED_8BIT)
-  emitter.EmitNewline()
-  GenerateMainGemmFunction(emitter, _FULL_32BIT)
-  emitter.EmitNewline()
-  GenerateMainGemmFunction(emitter, _FULL_FLOAT)
-  emitter.EmitNewline()
-  GenerateWrapperGemmFunction(emitter, _QUANTIZED_8BIT)
-  emitter.EmitNewline()
-  GenerateWrapperGemmFunction(emitter, _FULL_32BIT)
-  emitter.EmitNewline()
-  GenerateWrapperGemmFunction(emitter, _FULL_FLOAT)
-  emitter.EmitNewline()
-
-  emitter.EmitNamespaceEnd()
-  emitter.EmitNamespaceEnd()
-  emitter.EmitNewline()
-
-  emitter.EmitPreprocessor('else')
-  emitter.EmitPreprocessor1('warning',
-                            '"Meta gemm fast-path requires GEMMLOWP_NEON_32!"')
-  emitter.EmitPreprocessor('endif')
-  emitter.EmitNewline()
-
-  emitter.EmitHeaderEnd()
-
-
-if __name__ == '__main__':
-  Main()
+    GenerateWrapperGemmFunction(emitter, output_type)
+    emitter.EmitNewline()
diff --git a/meta/generators/gemv_1xMxK_neon.py b/meta/generators/gemv_1xMxK_neon.py
new file mode 100644
index 0000000..aba6983
--- /dev/null
+++ b/meta/generators/gemv_1xMxK_neon.py
@@ -0,0 +1,285 @@
+"""Generates the specialized gemv functions."""
+
+import mul_1x8_Mx8_neon
+import mul_Nx8_Mx8_neon
+import qnt_Nx8_neon
+import zip_Nx8_neon
+
+_QUANTIZED_8BIT = 'quantized_8bit'
+_FULL_32BIT = 'full_32bit'
+_FULL_FLOAT = 'full_float'
+
+
+class Error(Exception):
+  """Module level error."""
+
+
+class ConfigurationError(Error):
+  """Runtime configuration error."""
+
+
+def GenerateCommonTempsCountersAndConsts(emitter):
+  """Generates common gemv boilerplate variables."""
+  emitter.EmitDeclare('const std::int32_t', 'col_chunks', 'n / 8')
+  emitter.EmitDeclare('const std::int32_t', 'padded_k', '((k + 7) / 8) * 8')
+  emitter.EmitDeclare('const std::int32_t', 'chunk_size', 'k * 4')
+  emitter.EmitDeclare('const std::int32_t', 'zipped_chunk_size',
+                      '(padded_k + 16) * 4')
+  emitter.EmitDeclare('const std::uint8_t*', 'rhs_chunk', 'rhs')
+  emitter.EmitDeclare('std::uint8_t*', 'zipped_lhs', 'scratch')
+  emitter.EmitDeclare('std::int32_t*', 'zipped_lhs_offsets',
+                      'reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k)')
+  emitter.EmitDeclare('std::uint8_t*', 'zipped_rhs_1',
+                      'scratch + padded_k + 16')
+  emitter.EmitDeclare('std::uint8_t*', 'zipped_rhs_2',
+                      'zipped_rhs_1 + zipped_chunk_size')
+  emitter.EmitNewline()
+
+
+def GenerateQuantized8BitTempsCountersAndConsts(emitter):
+  """Generates all the boilerplate variables for the q8 gemm function."""
+  GenerateCommonTempsCountersAndConsts(emitter)
+  emitter.EmitDeclare('const std::int32_t', 'const_offset',
+                      'lhs_offset * rhs_offset * k + result_offset')
+  emitter.EmitDeclare('const std::int32_t', 'rounding_offset',
+                      '(1 << (shift - 1))')
+  emitter.EmitDeclare('std::int32_t*', 'temp_result',
+                      'reinterpret_cast<std::int32_t*>('
+                      'zipped_rhs_2 + zipped_chunk_size)')
+  emitter.EmitDeclare('std::int32_t*', 'mul_result_chunk', 'temp_result')
+  emitter.EmitNewline()
+
+
+def GenerateFullTempsCountersAndConsts(emitter, result_type):
+  """Generates all the boilerplate variables for the int32 and float gemms."""
+  GenerateCommonTempsCountersAndConsts(emitter)
+  emitter.EmitDeclare('const std::int32_t', 'const_offset',
+                      'lhs_offset * rhs_offset * k')
+  emitter.EmitDeclare(result_type, 'mul_result_chunk', 'result')
+  emitter.EmitNewline()
+
+
+def GenerateZipVector(emitter, aligned, leftovers):
+  emitter.EmitCall(
+      zip_Nx8_neon.BuildName(1, leftovers, aligned),
+      ['lhs', 'k', 'k', 'zipped_lhs', 'rhs_offset', 0])
+
+
+def GetMul2Params(result_type):
+  params = ['zipped_lhs', 'zipped_rhs_1', 'zipped_rhs_2', 'padded_k',
+            'mul_result_chunk']
+  if result_type is 'float':
+    params.append('result_scale')
+  return params
+
+
+def GetMulParams(result_type):
+  params = ['zipped_lhs', 'zipped_rhs_1', 'padded_k', 'mul_result_chunk', 0]
+  if result_type is 'float':
+    params.append('result_scale')
+  return params
+
+
+def GenerateMulCols(emitter, result_type, lhs_add, rhs_add, aligned, cols,
+                    leftovers):
+  """Emits code responsible for multiplication of one horizontal lhs strip."""
+  emitter.EmitOpenBracket('for (int i = 0; i < col_chunks; ++i)')
+  emitter.EmitCall(
+      zip_Nx8_neon.BuildName(4, leftovers, aligned),
+      ['rhs_chunk', 'k', 'k', 'zipped_rhs_1', 'lhs_offset', 'const_offset'])
+  emitter.EmitAssignIncrement('rhs_chunk', 'chunk_size')
+
+  emitter.EmitCall(
+      zip_Nx8_neon.BuildName(4, leftovers, aligned),
+      ['rhs_chunk', 'k', 'k', 'zipped_rhs_2', 'lhs_offset', 'const_offset'])
+  emitter.EmitAssignIncrement('rhs_chunk', 'chunk_size')
+
+  emitter.EmitCall(
+      mul_1x8_Mx8_neon.BuildName(result_type, lhs_add, rhs_add, 8),
+      GetMul2Params(result_type))
+
+  emitter.EmitAssignIncrement('mul_result_chunk', 8)
+  emitter.EmitCloseBracket()
+
+  if cols > 4:
+    emitter.EmitCall(
+        zip_Nx8_neon.BuildName(4, leftovers, aligned),
+        ['rhs_chunk', 'k', 'k', 'zipped_rhs_1', 'lhs_offset', 'const_offset'])
+    emitter.EmitAssignIncrement('rhs_chunk', 'chunk_size')
+
+    emitter.EmitCall(
+        zip_Nx8_neon.BuildName(cols - 4, leftovers, aligned),
+        ['rhs_chunk', 'k', 'k', 'zipped_rhs_2', 'lhs_offset', 'const_offset'])
+
+    emitter.EmitCall(
+        mul_1x8_Mx8_neon.BuildName(result_type, lhs_add, rhs_add, cols),
+        GetMul2Params(result_type))
+  elif cols > 0:
+    emitter.EmitCall(
+        zip_Nx8_neon.BuildName(cols, leftovers, aligned),
+        ['rhs_chunk', 'k', 'k', 'zipped_rhs_1', 'lhs_offset', 'const_offset'])
+
+    emitter.EmitCall(
+        mul_Nx8_Mx8_neon.BuildName(result_type, lhs_add, rhs_add, 1, cols),
+        GetMulParams(result_type))
+
+
+def GenerateQuantized8BitMul(emitter, aligned, cols, leftovers):
+  """Emits code for all lhs strips & leftover rows. Quantize after mul code."""
+  GenerateMulCols(emitter, 'int32', False, True, aligned, cols, leftovers)
+  emitter.EmitCall(
+      qnt_Nx8_neon.BuildName(1, cols, aligned),
+      ['temp_result', 'n', 0, 'zipped_lhs_offsets', 'result', 0,
+       'multiplicative_offset', 'rounding_offset', '-shift'])
+
+
+def GenerateFullMul(emitter, result_type, aligned, cols, leftovers):
+  GenerateMulCols(emitter, result_type, True, True, aligned, cols, leftovers)
+
+
+def BuildName(output_type, aligned, cols, leftover):
+  name = BuildMainGemvName(output_type) + '_%d_%d' % (cols, leftover)
+  if aligned:
+    name += '_aligned'
+  return name
+
+
+def GetCommonGemvParameters():
+  return [['std::uint8_t*', 'scratch'], ['const std::uint8_t*', 'lhs'],
+          ['const std::uint8_t*', 'rhs'], ['std::int32_t', 'n'],
+          ['std::int32_t', 'k'], ['std::int32_t', 'lhs_offset'],
+          ['std::int32_t', 'rhs_offset']]
+
+
+def GetGemvParameters(output_type):
+  """Prepares a (type, parameter) array for the gemm functions."""
+  params = GetCommonGemvParameters()
+  if output_type is _QUANTIZED_8BIT:
+    params += [['std::int32_t', 'result_offset'],
+               ['std::int32_t', 'multiplicative_offset'],
+               ['std::int32_t', 'shift'], ['std::uint8_t*', 'result']]
+  elif output_type is _FULL_32BIT:
+    params += [['std::int32_t*', 'result']]
+  elif output_type is _FULL_FLOAT:
+    params += [['float', 'result_scale'], ['float*', 'result']]
+  else:
+    raise ConfigurationError('Unsupported output type: %s' % output_type)
+  return params
+
+
+def GenerateGemv(emitter, output_type, aligned, cols, leftovers):
+  """Build one gemm function for given col, and depth leftovers."""
+  emitter.EmitFunctionBeginA(
+      BuildName(output_type, aligned, cols, leftovers),
+      GetGemvParameters(output_type), 'void')
+
+  emitter.EmitAssert('n %% 8 == %d' % cols)
+  emitter.EmitAssert('k %% 8 == %d' % leftovers)
+
+  if output_type is _QUANTIZED_8BIT:
+    GenerateQuantized8BitTempsCountersAndConsts(emitter)
+    GenerateZipVector(emitter, aligned, leftovers)
+    GenerateQuantized8BitMul(emitter, aligned, cols, leftovers)
+  elif output_type is _FULL_32BIT:
+    GenerateFullTempsCountersAndConsts(emitter, 'std::int32_t*')
+    GenerateZipVector(emitter, aligned, leftovers)
+    GenerateFullMul(emitter, 'int32', aligned, cols, leftovers)
+  elif output_type is _FULL_FLOAT:
+    GenerateFullTempsCountersAndConsts(emitter, 'float*')
+    GenerateZipVector(emitter, aligned, leftovers)
+    GenerateFullMul(emitter, 'float', aligned, cols, leftovers)
+  else:
+    raise ConfigurationError('Unknown output type: %s' % output_type)
+
+  emitter.EmitFunctionEnd()
+
+
+def GenerateGemvCall(emitter, output_type, aligned, m_mod, leftovers):
+  emitter.EmitCall(
+      emitter.Scope('internal',
+                    BuildName(output_type, aligned, m_mod, leftovers)),
+      [p for (unused_t, p) in GetGemvParameters(output_type)])
+
+
+def GenerateGemvSwitch2(emitter, output_type, aligned, n_mod):
+  """Second level of main switch, choose optimized version on depth leftover."""
+  emitter.EmitSwitch('k % 8')
+
+  for leftovers in range(0, 8):
+    emitter.EmitCase(leftovers)
+    emitter.PushIndent()
+    GenerateGemvCall(emitter, output_type, aligned, n_mod, leftovers)
+    emitter.EmitBreak()
+    emitter.PopIndent()
+
+  emitter.EmitSwitchEnd()
+
+
+def GenerateGemvSwitch1(emitter, output_type, aligned):
+  """First level of main switch, choose optimized version on cols leftover."""
+  emitter.EmitSwitch('n % 8')
+
+  for n_mod in range(0, 8):
+    emitter.EmitCase(n_mod)
+    emitter.PushIndent()
+    GenerateGemvSwitch2(emitter, output_type, aligned, n_mod)
+    emitter.EmitBreak()
+    emitter.PopIndent()
+
+  emitter.EmitSwitchEnd()
+
+
+def BuildMainGemvName(output_type):
+  if output_type is _QUANTIZED_8BIT:
+    return 'gemv_q8'
+  elif output_type is _FULL_32BIT:
+    return 'gemv_i32'
+  elif output_type is _FULL_FLOAT:
+    return 'gemv_f'
+  else:
+    raise ConfigurationError('Unsupported output type: %s' % output_type)
+
+
+def GenerateMainGemvFunction(emitter, output_type):
+  """Emit high level gemv function that switches between optimized versions."""
+  emitter.EmitFunctionBeginA(
+      BuildMainGemvName(output_type), GetGemvParameters(output_type), 'void')
+
+  emitter.EmitDeclare('const bool', 'lhs_aligned',
+                      '((reinterpret_cast<std::uintptr_t>(lhs) % 8) == 0)')
+  emitter.EmitDeclare('const bool', 'rhs_aligned',
+                      '((reinterpret_cast<std::uintptr_t>(rhs) % 8) == 0)')
+  emitter.EmitDeclare('const bool', 'k_aligned', '((k % 8) == 0)')
+
+  if output_type is _QUANTIZED_8BIT:
+    emitter.EmitDeclare('const bool', 'result_aligned',
+                        '((reinterpret_cast<std::uintptr_t>(result) % 8) == 0)')
+    emitter.EmitDeclare('const bool', 'aligned',
+                        'lhs_aligned && rhs_aligned && result_aligned '
+                        '&& k_aligned')
+  else:
+    emitter.EmitDeclare('const bool', 'aligned',
+                        'lhs_aligned && rhs_aligned && k_aligned')
+
+  emitter.EmitIf('aligned')
+  GenerateGemvSwitch1(emitter, output_type, True)
+  emitter.EmitElse()
+  GenerateGemvSwitch1(emitter, output_type, False)
+  emitter.EmitEndif()
+  emitter.EmitFunctionEnd()
+
+
+def GenerateInternalFunctions(emitter):
+  """Generate all the functions hidden in the internal namespace."""
+  for output_type in [_QUANTIZED_8BIT, _FULL_32BIT, _FULL_FLOAT]:
+    for aligned in [True, False]:
+      for cols in range(0, 8):
+        for leftover in range(0, 8):
+          GenerateGemv(emitter, output_type, aligned, cols, leftover)
+          emitter.EmitNewline()
+
+
+def GeneratePublicFunctions(emitter):
+  for output_type in [_QUANTIZED_8BIT, _FULL_32BIT, _FULL_FLOAT]:
+    GenerateMainGemvFunction(emitter, output_type)
+    emitter.EmitNewline()
diff --git a/meta/generators/meta_neon.py b/meta/generators/meta_neon.py
new file mode 100644
index 0000000..31b36b9
--- /dev/null
+++ b/meta/generators/meta_neon.py
@@ -0,0 +1,116 @@
+"""Generates the meta gemm/gemv library header."""
+
+import cc_emitter
+import gemm_NxMxK_neon
+import gemv_1xMxK_neon
+import mul_1x8_Mx8_neon
+import mul_Nx8_Mx8_neon
+import neon_emitter
+import qnt_Nx8_neon
+import zip_Nx8_neon
+
+_HEADER_COPYRIGHT = """// Copyright 2015 Google Inc. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+//
+// single_thread_gemm.h: programatically generated GEMM library header.
+"""
+
+
+def GenerateInternalFunctions(emitter):
+  """Generate all the functions hidden in the internal namespace."""
+  zip_Nx8_neon.GenerateFunctions(neon_emitter.NeonEmitter())
+  emitter.EmitNewline()
+
+  mul_Nx8_Mx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), 'int32', False,
+                                     True)
+  emitter.EmitNewline()
+
+  mul_Nx8_Mx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), 'int32', True,
+                                     True)
+  emitter.EmitNewline()
+
+  mul_Nx8_Mx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), 'float', True,
+                                     True)
+  emitter.EmitNewline()
+
+  mul_1x8_Mx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), 'int32', False,
+                                     True)
+  emitter.EmitNewline()
+
+  mul_1x8_Mx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), 'int32', True,
+                                     True)
+  emitter.EmitNewline()
+
+  mul_1x8_Mx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), 'float', True,
+                                     True)
+  emitter.EmitNewline()
+
+  qnt_Nx8_neon.GenerateFunctions(neon_emitter.NeonEmitter(), emitter)
+  emitter.EmitNewline()
+
+  gemm_NxMxK_neon.GenerateInternalFunctions(emitter)
+  emitter.EmitNewline()
+
+  gemv_1xMxK_neon.GenerateInternalFunctions(emitter)
+  emitter.EmitNewline()
+
+
+def GeneratePublicFunctions(emitter):
+  gemm_NxMxK_neon.GeneratePublicFunctions(emitter)
+  emitter.EmitNewline()
+
+  gemv_1xMxK_neon.GeneratePublicFunctions(emitter)
+  emitter.EmitNewline()
+
+
+def Main():
+  """Generate the single threaded meta gemm library."""
+  emitter = cc_emitter.CCEmitter()
+
+  emitter.EmitCodeNoSemicolon(_HEADER_COPYRIGHT)
+  emitter.EmitHeaderBegin('gemmlowp_meta_single_thread_gemm')
+
+  emitter.EmitPreprocessor1('ifdef', 'GEMMLOWP_NEON_32')
+  emitter.EmitNewline()
+
+  emitter.EmitInclude('<cassert>')
+  emitter.EmitNewline()
+
+  emitter.EmitNamespaceBegin('gemmlowp')
+  emitter.EmitNamespaceBegin('meta')
+  emitter.EmitNamespaceBegin('internal')
+  emitter.EmitNewline()
+
+  GenerateInternalFunctions(emitter)
+
+  emitter.EmitNamespaceEnd()
+  emitter.EmitNewline()
+
+  GeneratePublicFunctions(emitter)
+
+  emitter.EmitNamespaceEnd()
+  emitter.EmitNamespaceEnd()
+  emitter.EmitNewline()
+
+  emitter.EmitPreprocessor('else')
+  emitter.EmitPreprocessor1('warning',
+                            '"Meta gemm fast-path requires GEMMLOWP_NEON_32!"')
+  emitter.EmitPreprocessor('endif')
+  emitter.EmitNewline()
+
+  emitter.EmitHeaderEnd()
+
+
+if __name__ == '__main__':
+  Main()
diff --git a/meta/generators/metagemm_generate_headers.sh b/meta/generators/metagemm_generate_headers.sh
new file mode 100755
index 0000000..e7e92c6
--- /dev/null
+++ b/meta/generators/metagemm_generate_headers.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+python streams_arm_32.py > ../streams_arm_32.h
+python streams_arm_64.py > ../streams_arm_64.h
+python quantized_mul_kernels_arm_32.py > ../quantized_mul_kernels_arm_32.h
+python quantized_mul_kernels_arm_64.py > ../quantized_mul_kernels_arm_64.h
+python transform_kernels_arm_32.py > ../transform_kernels_arm_32.h
+python transform_kernels_arm_64.py > ../transform_kernels_arm_64.h
+
diff --git a/meta/generators/mul_1x8_Mx8_neon.py b/meta/generators/mul_1x8_Mx8_neon.py
new file mode 100644
index 0000000..9b9b44b
--- /dev/null
+++ b/meta/generators/mul_1x8_Mx8_neon.py
@@ -0,0 +1,285 @@
+"""Multiply primitive optimized for the gemv operation."""
+
+import neon_emitter
+
+
+class Error(Exception):
+  """Module level error."""
+
+
+class ConfigurationError(Error):
+  """Unsupported configuration."""
+
+
+def GenerateLoadMultiplyAggregate(emitter, registers, lanes_count, aggregators,
+                                  count, lhs, rhs_1, rhs_2):
+  """Emit inner loop for 1 row x M cols multiplication."""
+  emitter.EmitComment('General 1xM lanes loop.')
+  emitter.EmitNumericalLabel(1)
+  emitter.EmitNewline()
+  emitter.EmitComment('Subtract counter.')
+  emitter.EmitSubs(count, count, emitter.ImmediateConstant(8))
+  emitter.EmitNewline()
+
+  right_load = [registers.DoubleRegister() for unused_i in range(4)]
+  left_load = registers.DoubleRegister()
+
+  emitter.EmitVLoad('1.8', left_load, emitter.DereferenceIncrement(lhs, 64))
+  emitter.EmitVLoadA('1.8', right_load, emitter.DereferenceIncrement(rhs_1, 64))
+
+  emitter.EmitPldOffset(lhs, emitter.ImmediateConstant(64))
+  emitter.EmitPldOffset(rhs_1, emitter.ImmediateConstant(128))
+
+  multiply_results = [registers.QuadRegister() for unused_i in range(4)]
+
+  for i in range(4):
+    emitter.EmitVMull('u8', multiply_results[i], right_load[i], left_load)
+
+  emitter.EmitVLoadA('1.8', right_load[:lanes_count],
+                     emitter.DereferenceIncrement(rhs_2, 64))
+  emitter.EmitPldOffset(rhs_2, emitter.ImmediateConstant(lanes_count * 32))
+
+  for i in range(4):
+    emitter.EmitVPadal('u16', aggregators[i], multiply_results[i])
+
+  for i in range(lanes_count):
+    emitter.EmitVMull('u8', multiply_results[i], right_load[i], left_load)
+
+  for i in range(lanes_count):
+    emitter.EmitVPadal('u16', aggregators[i + 4], multiply_results[i])
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Loop break.')
+  emitter.EmitBneBack(1)
+  emitter.EmitNewline()
+
+  registers.FreeRegister(left_load)
+  registers.FreeRegisters(right_load)
+  registers.FreeRegisters(multiply_results)
+
+
+def ReadLeft(emitter, registers, lhs):
+  register = registers.QuadRegister()
+  emitter.EmitVLoadA('1.32', [emitter.AllLanes(registers.Low(register)),
+                              emitter.AllLanes(registers.High(register))],
+                     emitter.Dereference(lhs, None))
+  return register
+
+
+def ReadRight(emitter, registers, rhs, count):
+  if count == 1 or count == 2:
+    register = registers.DoubleRegister()
+  elif count == 3 or count == 4:
+    register = registers.QuadRegister()
+  else:
+    raise ConfigurationError('Unsupported elements no: %d' % count)
+  emitter.EmitVLoad('1.32', register, emitter.Dereference(rhs, 64))
+  return register
+
+
+def DuplicateGeneralRegister(emitter, registers, general_register,
+                             min_register):
+  duplicated = registers.QuadRegister(min_register)
+  emitter.EmitVDup('32', duplicated, general_register)
+  return duplicated
+
+
+def GenerateAggregatorReduceStore(emitter, registers, lanes_count, aggregators,
+                                  result_type, lhs_add, rhs_add, lhs, rhs_1,
+                                  rhs_2, results):
+  """Generates assembly responsible for reducing the 4 way aggregators."""
+  if lhs_add:
+    left_offset = ReadLeft(emitter, registers, lhs)
+  else:
+    left_offset = None
+
+  if rhs_add:
+    right_offset_1 = ReadRight(emitter, registers, rhs_1, 4)
+    right_offset_2 = ReadRight(emitter, registers, rhs_2, lanes_count)
+  else:
+    right_offset_1 = None
+    right_offset_2 = None
+
+  if result_type is 'float':
+    result_scale = DuplicateGeneralRegister(
+        emitter, registers, registers.MapParameter('result_scale'), 4)
+  else:
+    result_scale = None
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Horizontal reduce aggregators.')
+  for aggregator in aggregators:
+    emitter.EmitVPadd('u32', registers.Low(aggregator),
+                      registers.Low(aggregator), registers.High(aggregator))
+
+  temp = aggregators[0]
+  emitter.EmitVPadd('u32', registers.Low(temp), registers.Low(aggregators[0]),
+                    registers.Low(aggregators[1]))
+  emitter.EmitVPadd('u32', registers.High(temp), registers.Low(aggregators[2]),
+                    registers.Low(aggregators[3]))
+
+  if lanes_count == 1:
+    temp_2 = registers.Low(aggregators[1])
+    emitter.EmitVPadd('u32', temp_2, registers.Low(aggregators[4]),
+                      registers.Low(aggregators[4]))
+  elif lanes_count == 2:
+    temp_2 = registers.Low(aggregators[1])
+    emitter.EmitVPadd('u32', temp_2, registers.Low(aggregators[4]),
+                      registers.Low(aggregators[5]))
+  elif lanes_count == 3:
+    temp_2 = aggregators[1]
+    emitter.EmitVPadd('u32', registers.Low(temp_2),
+                      registers.Low(aggregators[4]),
+                      registers.Low(aggregators[5]))
+    emitter.EmitVPadd('u32', registers.High(temp_2),
+                      registers.Low(aggregators[6]),
+                      registers.Low(aggregators[6]))
+  elif lanes_count == 4:
+    temp_2 = aggregators[1]
+    emitter.EmitVPadd('u32', registers.Low(temp_2),
+                      registers.Low(aggregators[4]),
+                      registers.Low(aggregators[5]))
+    emitter.EmitVPadd('u32', registers.High(temp_2),
+                      registers.Low(aggregators[6]),
+                      registers.Low(aggregators[7]))
+  else:
+    temp_2 = None
+
+  if lhs_add:
+    emitter.EmitNewline()
+    emitter.EmitComment('Add lhs offsets to aggregated rows.')
+    emitter.EmitVAdd('s32', temp, temp, left_offset)
+    if lanes_count == 1 or lanes_count == 2:
+      emitter.EmitVAdd('s32', temp_2, temp_2, registers.Low(left_offset))
+    elif lanes_count == 3 or lanes_count == 4:
+      emitter.EmitVAdd('s32', temp_2, temp_2, left_offset)
+
+  if rhs_add:
+    emitter.EmitNewline()
+    emitter.EmitComment('Add rhs offset to aggregated rows.')
+    emitter.EmitVAdd('s32', temp, temp, right_offset_1)
+    emitter.EmitVAdd('s32', temp_2, temp_2, right_offset_2)
+
+  if result_type is 'float':
+    emitter.EmitNewline()
+    emitter.EmitComment('Convert to float and scale.')
+    emitter.EmitVCvt('f32', 's32', temp, temp)
+    emitter.EmitVCvt('f32', 's32', temp_2, temp_2)
+    emitter.EmitVMul('f32', temp, temp, result_scale)
+    if lanes_count == 1 or lanes_count == 2:
+      emitter.EmitVMul('f32', temp_2, temp_2, registers.Low(result_scale))
+    elif lanes_count == 3 or lanes_count == 4:
+      emitter.EmitVMul('f32', temp_2, temp_2, result_scale)
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Store results.')
+  if lanes_count == 1:
+    emitter.EmitVStoreA('1.32', [registers.Low(temp), registers.High(temp)],
+                        emitter.DereferenceIncrement(results, None))
+    emitter.EmitVStore('1.32', emitter.Lane(temp_2, 0),
+                       emitter.Dereference(results, None))
+  elif lanes_count == 2:
+    emitter.EmitVStoreA('1.32', [registers.Low(temp), registers.High(temp),
+                                 temp_2], emitter.Dereference(results, None))
+  elif lanes_count == 3:
+    emitter.EmitVStoreA(
+        '1.32',
+        [registers.Low(temp), registers.High(temp), registers.Low(temp_2)],
+        emitter.DereferenceIncrement(results, None))
+    emitter.EmitVStore('1.32', emitter.Lane(
+        registers.High(temp_2), 0), emitter.Dereference(results, None))
+  elif lanes_count == 4:
+    emitter.EmitVStoreA('1.32', [registers.Low(temp), registers.High(temp),
+                                 registers.Low(temp_2), registers.High(temp_2)],
+                        emitter.Dereference(results, None))
+
+
+def BuildName(result_type, lhs_add, rhs_add, lanes):
+  name = 'mul_1x8_%dx8_%s' % (lanes, result_type)
+  if lhs_add:
+    name += '_lhsadd'
+  if rhs_add:
+    name += '_rhsadd'
+  return name
+
+
+def CppResultType(result_type):
+  if result_type is 'int32':
+    return 'std::int32_t*'
+  elif result_type is 'float':
+    return 'float*'
+  else:
+    raise ConfigurationError('Unsupported result type: %s' % result_type)
+
+
+def GetParameters(result_type):
+  params = [['const std::uint8_t*', 'lhs'], ['const std::uint8_t*', 'rhs_1'],
+            ['const std::uint8_t*', 'rhs_2'], ['std::int32_t', 'count'],
+            [CppResultType(result_type), 'result']]
+  if result_type is 'float':
+    params.append(['float', 'result_scale'])
+  return params
+
+
+def GenerateAndClearAggregators(emitter, registers, aggregator_count):
+  """Prepare aggregators and emit aggregator clear code."""
+  emitter.EmitNewline()
+  emitter.EmitComment('Clear aggregators.')
+  aggregators = []
+  for i in range(aggregator_count):
+    aggregator = registers.QuadRegister()
+    aggregators.append(aggregator)
+    if i < 3:
+      emitter.EmitVMov('i32', aggregator, emitter.ImmediateConstant(0))
+    else:
+      emitter.EmitVMov('i32', aggregator, aggregators[i - 3])
+  emitter.EmitNewline()
+  return aggregators
+
+
+def GenerateMul1x8Mx8(emitter, result_type, lhs_add, rhs_add, lanes_count):
+  """Generates the 1xN multiplication primitive."""
+  if lanes_count < 1 or lanes_count > 4:
+    raise ConfigurationError('Lanes should be: 1, 2, 3 or 4.')
+
+  emitter.EmitFunctionBeginA(
+      BuildName(result_type, lhs_add, rhs_add, lanes_count + 4),
+      GetParameters(result_type), 'inline void')
+
+  emitter.EmitAssert('count % 8 == 0')
+  emitter.EmitAssert('count >= 8')
+  emitter.EmitAsmBegin()
+
+  registers = neon_emitter.NeonRegisters()
+
+  count = registers.MapParameter('count')
+
+  lhs = registers.MapParameter('lhs')
+  rhs_1 = registers.MapParameter('rhs_1')
+  rhs_2 = registers.MapParameter('rhs_2')
+
+  emitter.EmitPld(lhs)
+  emitter.EmitPld(rhs_1)
+  emitter.EmitPld(rhs_2)
+
+  aggregators = GenerateAndClearAggregators(emitter, registers, lanes_count + 4)
+
+  GenerateLoadMultiplyAggregate(emitter, registers, lanes_count, aggregators,
+                                count, lhs, rhs_1, rhs_2)
+  GenerateAggregatorReduceStore(emitter, registers, lanes_count, aggregators,
+                                result_type, lhs_add, rhs_add, lhs, rhs_1,
+                                rhs_2, registers.MapParameter('result'))
+
+  emitter.EmitAsmEnd(registers.MappedParameters(), [],
+                     registers.Clobbers() + ['cc', 'memory'])
+  emitter.EmitFunctionEnd()
+
+
+def GenerateFunctions(emitter, result_type, lhs_add, rhs_add):
+  for lanes in range(1, 5):
+    GenerateMul1x8Mx8(emitter, result_type, lhs_add, rhs_add, lanes)
+    emitter.EmitNewline()
+
+
+if __name__ == '__main__':
+  GenerateFunctions(neon_emitter.NeonEmitter(), 'int32', True, True)
diff --git a/meta/generators/mul_Nx8_Mx8_neon.py b/meta/generators/mul_Nx8_Mx8_neon.py
index 9396163..bbdf881 100644
--- a/meta/generators/mul_Nx8_Mx8_neon.py
+++ b/meta/generators/mul_Nx8_Mx8_neon.py
@@ -167,7 +167,7 @@
     register = registers.DoubleRegister(min_reg * 2)
     emitter.EmitVLoad('1.32', register, emitter.Dereference(input_address, 64))
     return register
-  elif elements == 3:
+  elif elements == 3 or elements == 4:
     register = registers.QuadRegister(min_reg)
     emitter.EmitVLoad('1.32', register, emitter.Dereference(input_address, 64))
     return register
@@ -181,7 +181,7 @@
   if cols == 1 or cols == 2:
     for unused_i in range(0, rows):
       duplicated.append(registers.DoubleRegister(min_register))
-  elif cols == 3:
+  elif cols == 3 or cols == 4:
     for unused_i in range(0, rows):
       duplicated.append(registers.QuadRegister(min_register))
   else:
@@ -199,6 +199,15 @@
         registers.Low(values), 1))
     emitter.EmitVDup('32', duplicated[2], emitter.Lane(
         registers.High(values), 0))
+  elif rows == 4:
+    emitter.EmitVDup('32', duplicated[0], emitter.Lane(
+        registers.Low(values), 0))
+    emitter.EmitVDup('32', duplicated[1], emitter.Lane(
+        registers.Low(values), 1))
+    emitter.EmitVDup('32', duplicated[2], emitter.Lane(
+        registers.High(values), 0))
+    emitter.EmitVDup('32', duplicated[3], emitter.Lane(
+        registers.High(values), 1))
 
   return duplicated
 
@@ -207,7 +216,7 @@
                              min_register):
   if cols == 1 or cols == 2:
     duplicated = registers.DoubleRegister(min_register)
-  elif cols == 3:
+  elif cols == 3 or cols == 4:
     duplicated = registers.QuadRegister(min_register)
   else:
     raise ConfigurationError('Unsupported duplicate amount: %d' % cols)
@@ -234,6 +243,14 @@
                       registers.Low(aggregators[row * 3 + 2]),
                       registers.Low(aggregators[row * 3 + 2]))
     return register
+  elif cols == 4:
+    register = aggregators[row * 3]
+    emitter.EmitVPadd('u32', registers.Low(register), registers.Low(register),
+                      registers.Low(aggregators[row * 3 + 1]))
+    emitter.EmitVPadd('u32', registers.High(register),
+                      registers.Low(aggregators[row * 3 + 2]),
+                      registers.Low(aggregators[row * 3 + 3]))
+    return register
   else:
     raise ConfigurationError('Unsupported columns no: %d' % cols)
 
@@ -255,6 +272,10 @@
         registers.High(aggregator),
         0), emitter.Dereference(result_address, None), result_stride)
     emitter.EmitNewline()
+  elif cols == 4:
+    emitter.EmitVStoreOffsetA(
+        '1.32', [registers.Low(aggregator), registers.High(aggregator)],
+        emitter.Dereference(result_address, None), result_stride)
   else:
     raise ConfigurationError('Unsupported columns no: %d' % cols)
 
@@ -359,10 +380,10 @@
 def GenerateMulNx8Mx8(emitter, result_type, lhs_add, rhs_add, left_lanes_count,
                       right_lanes_count):
   """Emit the multiply code for given rows and cols counts."""
-  if left_lanes_count < 1 or left_lanes_count > 3:
-    raise ConfigurationError('Left_lanes should be: 1, 2 or 3.')
-  if right_lanes_count < 1 or right_lanes_count > 3:
-    raise ConfigurationError('Right_lanes should be: 1, 2 or 3.')
+  if left_lanes_count < 1 or left_lanes_count > 4:
+    raise ConfigurationError('Left_lanes should be: 1, 2, 3 or 4.')
+  if right_lanes_count < 1 or right_lanes_count > 4:
+    raise ConfigurationError('Right_lanes should be: 1, 2, 3 or 4.')
 
   emitter.EmitFunctionBeginA(
       BuildName(result_type, lhs_add, rhs_add, left_lanes_count,
@@ -378,35 +399,29 @@
 
   size = left_lanes_count * right_lanes_count
 
+  lhs = registers.MapParameter('lhs')
+  rhs = registers.MapParameter('rhs')
+
+  emitter.EmitPld(lhs)
+  emitter.EmitPld(rhs)
+
+  aggregators = GenerateAndClearAggregators(emitter, registers, size)
+
   if size < 9:
-    aggregators = GenerateAndClearAggregators(emitter, registers, size)
-
-    left_lanes = GenerateMulLanes(registers, left_lanes_count,
-                                  registers.MapParameter('lhs'))
-    right_lanes = GenerateMulLanes(registers, right_lanes_count,
-                                   registers.MapParameter('rhs'))
-
-    emitter.EmitPld(left_lanes.input_address)
-    emitter.EmitPld(right_lanes.input_address)
+    left_lanes = GenerateMulLanes(registers, left_lanes_count, lhs)
+    right_lanes = GenerateMulLanes(registers, right_lanes_count, rhs)
 
     GenerateNxMLoadMultiplyAggregate(emitter, registers, left_lanes,
                                      right_lanes, aggregators, count)
 
   else:  # left == 3 and right == 3
-    aggregators = GenerateAndClearAggregators(emitter, registers, size)
     backup_register = registers.QuadRegister()
-    left_lanes = Generate3MulLanes(backup_register, registers,
-                                   registers.MapParameter('lhs'))
-    right_lanes = GenerateMulLanes(registers, right_lanes_count,
-                                   registers.MapParameter('rhs'))
-
-    emitter.EmitPld(left_lanes.input_address)
-    emitter.EmitPld(right_lanes.input_address)
+    left_lanes = Generate3MulLanes(backup_register, registers, lhs)
+    right_lanes = GenerateMulLanes(registers, right_lanes_count, rhs)
 
     Generate3x3LoadMultiplyAggregate(emitter, registers, left_lanes,
                                      right_lanes, aggregators, count,
                                      backup_register)
-
   left_lanes.FreeRegisters(registers)
   right_lanes.FreeRegisters(registers)
 
@@ -426,3 +441,10 @@
       GenerateMulNx8Mx8(emitter, result_type, lhs_add, rhs_add, left_lanes,
                         right_lanes)
       emitter.EmitNewline()
+
+  GenerateMulNx8Mx8(emitter, result_type, lhs_add, rhs_add, 1, 4)
+  emitter.EmitNewline()
+
+
+if __name__ == '__main__':
+  GenerateFunctions(neon_emitter.NeonEmitter(), 'int32', True, True)
diff --git a/meta/generators/neon_emitter.py b/meta/generators/neon_emitter.py
index 79cb76d..726766e 100644
--- a/meta/generators/neon_emitter.py
+++ b/meta/generators/neon_emitter.py
@@ -1,4 +1,17 @@
-"""ARM/NEON assembly emitter.
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""32bit ARM/NEON assembly emitter.
 
 Used by code generators to produce ARM assembly with NEON simd code.
 Provides tools for easier register management: named register variable
@@ -27,30 +40,71 @@
   """Wrong lane number."""
 
 
-def Low(register):
+class ArgumentError(Error):
+  """Wrong argument."""
+
+
+def _Low(register):
   assert register[0] == 'q'
   num = int(register[1:])
   return 'd%d' % (num * 2)
 
 
-def High(register):
+def _High(register):
   assert register[0] == 'q'
   num = int(register[1:])
   return 'd%d' % (num * 2 + 1)
 
 
-class NeonRegisters(object):
-  """Utility that keeps track of used ARM/NEON registers."""
+def _ExpandQuads(registers):
+  doubles = []
+  for register in registers:
+    if register[0] == 'q':
+      doubles.append(_Low(register))
+      doubles.append(_High(register))
+    else:
+      doubles.append(register)
+  return doubles
+
+
+def _MakeCompatible(op1, op2, op3):
+  if op1[0] == 'd' or op2[0] == 'd' or op3[0] == 'd':
+    if op1[0] == 'q':
+      op1 = _Low(op1)
+    if op2[0] == 'q':
+      op2 = _Low(op2)
+    if op3[0] == 'q':
+      op3 = _Low(op3)
+  return (op1, op2, op3)
+
+
+class _NeonRegisters32Bit(object):
+  """Utility that keeps track of used 32bit ARM/NEON registers."""
 
   def __init__(self):
     self.double = set()
     self.double_ever = set()
     self.general = set()
     self.general_ever = set()
-    self.parameters = set()
+    self.parameters = dict()
+    self.output_parameters = dict()
 
-  def MapParameter(self, parameter):
-    self.parameters.add(parameter)
+  def MapParameter(self, parameter, parameter_value=None):
+    if not parameter_value:
+      parameter_value = parameter
+    self.parameters[parameter] = (parameter_value, 'r')
+    return '%%[%s]' % parameter
+
+  def MapMemoryParameter(self, parameter, parameter_value=None):
+    if not parameter_value:
+      parameter_value = parameter
+    self.parameters[parameter] = (parameter_value, 'm')
+    return '%%[%s]' % parameter
+
+  def MapOutputParameter(self, parameter, parameter_value=None):
+    if not parameter_value:
+      parameter_value = parameter
+    self.output_parameters[parameter] = (parameter_value, '+r')
     return '%%[%s]' % parameter
 
   def DoubleRegister(self, min_val=0):
@@ -80,24 +134,23 @@
     raise RegisterAllocationError('Not enough general registers.')
 
   def MappedParameters(self):
-    return [x for x in self.parameters]
+    return [(k, v) for (k, v) in self.parameters.items()]
+
+  def MappedOutputParameters(self):
+    return [(k, v) for (k, v) in self.output_parameters.items()]
 
   def Clobbers(self):
-    return (['r%d' % i
-             for i in self.general_ever] + ['d%d' % i
-                                            for i in self.DoubleClobbers()])
+    return (['r%d' % i for i in self.general_ever] +
+            ['d%d' % i for i in self.DoubleClobbers()])
 
   def DoubleClobbers(self):
     return sorted(self.double_ever)
 
-  def Low(self, register):
-    return Low(register)
-
-  def High(self, register):
-    return High(register)
-
   def FreeRegister(self, register):
     assert len(register) > 1
+    if register[0] not in ['r', 'd', 'q']:
+      return
+
     num = int(register[1:])
 
     if register[0] == 'r':
@@ -114,6 +167,10 @@
     else:
       raise RegisterDeallocationError('Register not allocated: %s' % register)
 
+  def FreeRegisters(self, registers):
+    for register in registers:
+      self.FreeRegister(register)
+
 
 class NeonEmitter(object):
   """Emits ARM/NEON assembly opcodes."""
@@ -123,11 +180,11 @@
     self.indent = ''
     self.debug = debug
 
-  def PushIndent(self):
-    self.indent += '  '
+  def PushIndent(self, delta='  '):
+    self.indent += delta
 
-  def PopIndent(self):
-    self.indent = self.indent[:-2]
+  def PopIndent(self, delta=2):
+    self.indent = self.indent[:-delta]
 
   def EmitIndented(self, what):
     print self.indent + what
@@ -189,10 +246,10 @@
     self.EmitIndented('asm volatile(')
     self.PushIndent()
 
-  def EmitAsmMapping(self, elements, modifier):
+  def EmitAsmMapping(self, elements):
     if elements:
-      self.EmitIndented(': ' + ', '.join(['[%s] "%s"(%s)' % (d, modifier, d)
-                                          for d in elements]))
+      self.EmitIndented(': ' + ', '.join(
+          ['[%s] "%s"(%s)' % (d, v[1], v[0]) for (d, v) in elements]))
     else:
       self.EmitIndented(':')
 
@@ -202,10 +259,10 @@
     else:
       self.EmitIndented(':')
 
-  def EmitAsmEnd(self, outputs, inputs, clobbers):
-    self.EmitAsmMapping(outputs, '+r')
-    self.EmitAsmMapping(inputs, 'r')
-    self.EmitClobbers(clobbers)
+  def EmitAsmEnd(self, registers):
+    self.EmitAsmMapping(registers.MappedOutputParameters())
+    self.EmitAsmMapping(registers.MappedParameters())
+    self.EmitClobbers(registers.Clobbers() + ['cc', 'memory'])
     self.PopIndent()
     self.EmitIndented(');')
 
@@ -227,18 +284,6 @@
     self.PushOp(op)
     self.EmitIndented('"%s %s, %s, %s\\n"' % (op, param1, param2, param3))
 
-  def EmitZip(self, size, param1, param2):
-    self.EmitOp2('vzip.%d' % size, param1, param2)
-
-  def EmitZip8(self, param1, param2):
-    self.EmitZip(8, param1, param2)
-
-  def EmitZip16(self, param1, param2):
-    self.EmitZip(16, param1, param2)
-
-  def EmitZip32(self, param1, param2):
-    self.EmitZip(32, param1, param2)
-
   def EmitAdd(self, destination, source, param):
     self.EmitOp3('add', destination, source, param)
 
@@ -254,15 +299,24 @@
   def EmitMov(self, param1, param2):
     self.EmitOp2('mov', param1, param2)
 
-  def EmitSkip(self, register, skip, stride):
-    self.EmitOp3('add', register, register, '#%d' % (skip * stride))
-
   def EmitBeqBack(self, label):
     self.EmitOp1('beq', '%db' % label)
 
   def EmitBeqFront(self, label):
     self.EmitOp1('beq', '%df' % label)
 
+  def EmitBgtBack(self, label):
+    self.EmitOp1('bgt', '%db' % label)
+
+  def EmitBgtFront(self, label):
+    self.EmitOp1('bgt', '%df' % label)
+
+  def EmitBleBack(self, label):
+    self.EmitOp1('ble', '%db' % label)
+
+  def EmitBleFront(self, label):
+    self.EmitOp1('ble', '%df' % label)
+
   def EmitBneBack(self, label):
     self.EmitOp1('bne', '%db' % label)
 
@@ -270,27 +324,66 @@
     self.EmitOp1('bne', '%df' % label)
 
   def EmitVAdd(self, add_type, destination, source_1, source_2):
+    destination, source_1, source_2 = _MakeCompatible(destination, source_1,
+                                                      source_2)
     self.EmitOp3('vadd.%s' % add_type, destination, source_1, source_2)
 
   def EmitVAddw(self, add_type, destination, source_1, source_2):
     self.EmitOp3('vaddw.%s' % add_type, destination, source_1, source_2)
 
+  def EmitVSub(self, sub_type, destination, source_1, source_2):
+    destination, source_1, source_2 = _MakeCompatible(destination, source_1,
+                                                      source_2)
+    self.EmitOp3('vsub.%s' % sub_type, destination, source_1, source_2)
+
   def EmitVCvt(self, cvt_to, cvt_from, destination, source):
     self.EmitOp2('vcvt.%s.%s' % (cvt_to, cvt_from), destination, source)
 
   def EmitVDup(self, dup_type, destination, source):
     self.EmitOp2('vdup.%s' % dup_type, destination, source)
 
+  def EmitVMax(self, size, destination, source_1, source_2):
+    self.EmitOp3('vmax.%s' % size, destination, source_1, source_2)
+
+  def EmitVMin(self, size, destination, source_1, source_2):
+    self.EmitOp3('vmin.%s' % size, destination, source_1, source_2)
+
   def EmitVMov(self, mov_type, destination, source):
     self.EmitOp2('vmov.%s' % mov_type, destination, source)
 
+  def EmitVMovl(self, mov_type, destination, source):
+    if source[0] == 'q':
+      source = _Low(source)
+    self.EmitOp2('vmovl.%s' % mov_type, destination, source)
+
+  def EmitVMovl2(self, mov_type, destination_1, destination_2, source):
+    self.EmitVMovl(mov_type, destination_2, _High(source))
+    self.EmitVMovl(mov_type, destination_1, _Low(source))
+
   def EmitVQmovn(self, mov_type, destination, source):
+    if destination[0] == 'q':
+      destination = _Low(destination)
     self.EmitOp2('vqmovn.%s' % mov_type, destination, source)
 
+  def EmitVQmovn2(self, mov_type, destination, source_1, source_2):
+    self.EmitVQmovn(mov_type, _Low(destination), source_1)
+    self.EmitVQmovn(mov_type, _High(destination), source_2)
+
   def EmitVQmovun(self, mov_type, destination, source):
+    if destination[0] == 'q':
+      destination = _Low(destination)
     self.EmitOp2('vqmovun.%s' % mov_type, destination, source)
 
+  def EmitVQmovun2(self, mov_type, destination, source_1, source_2):
+    self.EmitVQmovun(mov_type, _Low(destination), source_1)
+    self.EmitVQmovun(mov_type, _High(destination), source_2)
+
   def EmitVMul(self, mul_type, destination, source_1, source_2):
+    destination, source_1, source_2 = _MakeCompatible(destination, source_1,
+                                                      source_2)
+    self.EmitOp3('vmul.%s' % mul_type, destination, source_1, source_2)
+
+  def EmitVMulScalar(self, mul_type, destination, source_1, source_2):
     self.EmitOp3('vmul.%s' % mul_type, destination, source_1, source_2)
 
   def EmitVMull(self, mul_type, destination, source_1, source_2):
@@ -305,40 +398,422 @@
   def EmitVPadal(self, add_type, destination, source):
     self.EmitOp2('vpadal.%s' % add_type, destination, source)
 
-  def EmitVLoad(self, load_type, destination, source):
-    self.EmitOp2('vld%s' % load_type, '{%s}' % destination, '%s' % source)
+  def EmitLdr(self, register, value):
+    self.EmitOp2('ldr', register, value)
 
-  def EmitVLoadA(self, load_type, destinations, source):
-    self.EmitVLoad(load_type, ', '.join(destinations), source)
+  def EmitVLoad(self, load_no, load_type, destination, source):
+    self.EmitVLoadA(load_no, load_type, [destination], source)
+
+  def EmitVLoadA(self, load_no, load_type, destinations, source):
+    self.EmitOp2('vld%d.%d' % (load_no, load_type),
+                 '{%s}' % ', '.join(_ExpandQuads(destinations)), source)
+
+  def EmitVLoadAE(self,
+                  load_type,
+                  elem_count,
+                  destinations,
+                  source,
+                  alignment=None):
+    bits_to_load = load_type * elem_count
+    destinations = _ExpandQuads(destinations)
+    if len(destinations) * 64 < bits_to_load:
+      raise ArgumentError('To few destinations: %d to load %d bits.' %
+                          (len(destinations), bits_to_load))
+
+    while bits_to_load > 0:
+      if bits_to_load >= 256:
+        self.EmitVLoadA(1, 32, destinations[:4],
+                        self.DereferenceIncrement(source, alignment))
+        bits_to_load -= 256
+        destinations = destinations[4:]
+      elif bits_to_load >= 192:
+        self.EmitVLoadA(1, 32, destinations[:3],
+                        self.DereferenceIncrement(source, alignment))
+        bits_to_load -= 192
+        destinations = destinations[3:]
+      elif bits_to_load >= 128:
+        self.EmitVLoadA(1, 32, destinations[:2],
+                        self.DereferenceIncrement(source, alignment))
+        bits_to_load -= 128
+        destinations = destinations[2:]
+      elif bits_to_load >= 64:
+        self.EmitVLoad(1, 32, destinations[0],
+                       self.DereferenceIncrement(source, alignment))
+        bits_to_load -= 64
+        destinations = destinations[1:]
+      else:
+        destination = destinations[0]
+        if bits_to_load == 56:
+          self.EmitVLoad(1, 32,
+                         self.Lane(32, destination, 0),
+                         self.DereferenceIncrement(source))
+          self.EmitVLoad(1, 16,
+                         self.Lane(16, destination, 2),
+                         self.DereferenceIncrement(source))
+          self.EmitVLoad(1, 8,
+                         self.Lane(8, destination, 6),
+                         self.DereferenceIncrement(source))
+        elif bits_to_load == 48:
+          self.EmitVLoad(1, 32,
+                         self.Lane(32, destination, 0),
+                         self.DereferenceIncrement(source))
+          self.EmitVLoad(1, 16,
+                         self.Lane(16, destination, 2),
+                         self.DereferenceIncrement(source))
+        elif bits_to_load == 40:
+          self.EmitVLoad(1, 32,
+                         self.Lane(32, destination, 0),
+                         self.DereferenceIncrement(source))
+          self.EmitVLoad(1, 8,
+                         self.Lane(8, destination, 4),
+                         self.DereferenceIncrement(source))
+        elif bits_to_load == 32:
+          self.EmitVLoad(1, 32,
+                         self.Lane(32, destination, 0),
+                         self.DereferenceIncrement(source))
+        elif bits_to_load == 24:
+          self.EmitVLoad(1, 16,
+                         self.Lane(16, destination, 0),
+                         self.DereferenceIncrement(source))
+          self.EmitVLoad(1, 8,
+                         self.Lane(8, destination, 2),
+                         self.DereferenceIncrement(source))
+        elif bits_to_load == 16:
+          self.EmitVLoad(1, 16,
+                         self.Lane(16, destination, 0),
+                         self.DereferenceIncrement(source))
+        elif bits_to_load == 8:
+          self.EmitVLoad(1, 8,
+                         self.Lane(8, destination, 0),
+                         self.DereferenceIncrement(source))
+        else:
+          raise ArgumentError('Wrong leftover: %d' % bits_to_load)
+        return
+
+  def EmitVLoadE(self, load_type, count, destination, source, alignment=None):
+    self.EmitVLoadAE(load_type, count, [destination], source, alignment)
+
+  def EmitVLoadAllLanes(self, load_no, load_type, destination, source):
+    destinations = []
+    if destination[0] == 'q':
+      destinations.append(self.AllLanes(_Low(destination)))
+      destinations.append(self.AllLanes(_High(destination)))
+    else:
+      destinations.append(self.AllLanes(destination))
+    self.EmitVLoadA(load_no, load_type, destinations, source)
+
+  def EmitVLoadOffset(self, load_no, load_type, destination, source, offset):
+    self.EmitVLoadOffsetA(load_no, load_type, [destination], source, offset)
+
+  def EmitVLoadOffsetA(self, load_no, load_type, destinations, source, offset):
+    assert len(destinations) <= 4
+    self.EmitOp3('vld%d.%d' % (load_no, load_type),
+                 '{%s}' % ', '.join(_ExpandQuads(destinations)), source, offset)
 
   def EmitPld(self, load_address_register):
     self.EmitOp1('pld', '[%s]' % load_address_register)
 
+  def EmitPldw(self, store_address_register):
+    self.EmitOp1('pldw', '[%s]' % store_address_register)
+
   def EmitPldOffset(self, load_address_register, offset):
     self.EmitOp1('pld', '[%s, %s]' % (load_address_register, offset))
 
-  def EmitInstructionPreload(self, label):
-    self.EmitOp1('pli', label)
+  def EmitPldwOffset(self, store_address_register, offset):
+    self.EmitOp1('pldw', '[%s, %s]' % (store_address_register, offset))
 
   def EmitVShl(self, shift_type, destination, source, shift):
     self.EmitOp3('vshl.%s' % shift_type, destination, source, shift)
 
-  def EmitVStore(self, store_type, source, destination):
-    self.EmitOp2('vst%s' % store_type, '{%s}' % source, destination)
+  def EmitVStore(self, store_no, store_type, source, destination):
+    self.EmitVStoreA(store_no, store_type, [source], destination)
 
-  def EmitVStoreA(self, store_type, sources, destination):
-    self.EmitVStore(store_type, ', '.join(sources), destination)
+  def EmitVStoreA(self, store_no, store_type, sources, destination):
+    self.EmitOp2('vst%d.%d' % (store_no, store_type),
+                 '{%s}' % ', '.join(_ExpandQuads(sources)), destination)
 
-  def EmitVStoreOffset(self, store_type, source, destination, offset):
-    self.EmitOp3('vst%s' % store_type, '{%s}' % source, destination, offset)
+  def EmitVStoreAE(self,
+                   store_type,
+                   elem_count,
+                   sources,
+                   destination,
+                   alignment=None):
+    bits_to_store = store_type * elem_count
+    sources = _ExpandQuads(sources)
+    if len(sources) * 64 < bits_to_store:
+      raise ArgumentError('To few sources: %d to store %d bits.' %
+                          (len(sources), bits_to_store))
 
-  def Dereference(self, value, alignment):
+    while bits_to_store > 0:
+      if bits_to_store >= 256:
+        self.EmitVStoreA(1, 32, sources[:4],
+                         self.DereferenceIncrement(destination, alignment))
+        bits_to_store -= 256
+        sources = sources[4:]
+      elif bits_to_store >= 192:
+        self.EmitVStoreA(1, 32, sources[:3],
+                         self.DereferenceIncrement(destination, alignment))
+        bits_to_store -= 192
+        sources = sources[3:]
+      elif bits_to_store >= 128:
+        self.EmitVStoreA(1, 32, sources[:2],
+                         self.DereferenceIncrement(destination, alignment))
+        bits_to_store -= 128
+        sources = sources[2:]
+      elif bits_to_store >= 64:
+        self.EmitVStore(1, 32, sources[0],
+                        self.DereferenceIncrement(destination, alignment))
+        bits_to_store -= 64
+        sources = sources[1:]
+      else:
+        source = sources[0]
+        if bits_to_store == 56:
+          self.EmitVStore(1, 32,
+                          self.Lane(32, source, 0),
+                          self.DereferenceIncrement(destination))
+          self.EmitVStore(1, 16,
+                          self.Lane(16, source, 2),
+                          self.DereferenceIncrement(destination))
+          self.EmitVStore(1, 8,
+                          self.Lane(8, source, 6),
+                          self.DereferenceIncrement(destination))
+        elif bits_to_store == 48:
+          self.EmitVStore(1, 32,
+                          self.Lane(32, source, 0),
+                          self.DereferenceIncrement(destination))
+          self.EmitVStore(1, 16,
+                          self.Lane(16, source, 2),
+                          self.DereferenceIncrement(destination))
+        elif bits_to_store == 40:
+          self.EmitVStore(1, 32,
+                          self.Lane(32, source, 0),
+                          self.DereferenceIncrement(destination))
+          self.EmitVStore(1, 8,
+                          self.Lane(8, source, 4),
+                          self.DereferenceIncrement(destination))
+        elif bits_to_store == 32:
+          self.EmitVStore(1, 32,
+                          self.Lane(32, source, 0),
+                          self.DereferenceIncrement(destination))
+        elif bits_to_store == 24:
+          self.EmitVStore(1, 16,
+                          self.Lane(16, source, 0),
+                          self.DereferenceIncrement(destination))
+          self.EmitVStore(1, 8,
+                          self.Lane(8, source, 2),
+                          self.DereferenceIncrement(destination))
+        elif bits_to_store == 16:
+          self.EmitVStore(1, 16,
+                          self.Lane(16, source, 0),
+                          self.DereferenceIncrement(destination))
+        elif bits_to_store == 8:
+          self.EmitVStore(1, 8,
+                          self.Lane(8, source, 0),
+                          self.DereferenceIncrement(destination))
+        else:
+          raise ArgumentError('Wrong leftover: %d' % bits_to_store)
+        return
+
+  def EmitVStoreE(self, store_type, count, source, destination, alignment=None):
+    self.EmitVStoreAE(store_type, count, [source], destination, alignment)
+
+  def EmitVStoreOffset(self, store_no, store_type, source, destination, offset):
+    self.EmitVStoreOffsetA(store_no, store_type, [source], destination, offset)
+
+  def EmitVStoreOffsetA(self, store_no, store_type, sources, destination,
+                        offset):
+    self.EmitOp3('vst%d.%d' % (store_no, store_type),
+                 '{%s}' % ', '.join(_ExpandQuads(sources)), destination, offset)
+
+  def EmitVStoreOffsetE(self, store_type, count, source, destination, offset):
+    """Emit assembly to store a number elements from the source registers."""
+    if store_type is not 32:
+      raise ArgumentError('Unsupported store_type: %d' % store_type)
+
+    sources = []
+    if source[0] == 'q':
+      sources.append(_Low(source))
+      sources.append(_High(source))
+      if count * store_type > 128:
+        raise ArgumentError('To many %dbit elements in a q register: %d' %
+                            (store_type, count))
+    else:
+      sources.append(source)
+      if count * store_type > 64:
+        raise ArgumentError('To many %dbit elements in a d register: %d' %
+                            (store_type, count))
+
+    if count == 1:
+      self.EmitVStoreOffset(1, store_type,
+                            self.Lane(store_type, sources[0], 0),
+                            self.Dereference(destination, None), offset)
+    elif count == 2:
+      self.EmitVStoreOffset(1, store_type, sources[0],
+                            self.Dereference(destination, None), offset)
+    elif count == 3:
+      self.EmitVStore(1, store_type, sources[0],
+                      self.DereferenceIncrement(destination, None))
+      self.EmitVStoreOffset(1, store_type,
+                            self.Lane(store_type, sources[1], 0),
+                            self.Dereference(destination, None), offset)
+      self.EmitSub(destination, destination, self.ImmediateConstant(8))
+    elif count == 4:
+      self.EmitVStoreOffsetA(1, store_type, sources,
+                             self.Dereference(destination, None), offset)
+    else:
+      raise ArgumentError('To many elements: %d' % count)
+
+  def EmitVSumReduce(self, reduce_type, elem_count, reduce_count, destinations,
+                     sources):
+    """Emit assembly for n-fold horizontal sum reduction."""
+    if reduce_type is not 'u32':
+      raise ArgumentError('Unsupported reduce: %s' % reduce_type)
+
+    sources = _ExpandQuads(sources)
+
+    destinations = _ExpandQuads(destinations)
+
+    if len(destinations) * 2 < elem_count:
+      raise ArgumentError('Not enough space in destination: %d vs %d' %
+                          (len(destinations) * 2, elem_count))
+
+    if len(sources) * 2 != elem_count * reduce_count:
+      raise ArgumentError('Wrong number of sources: %d vs %d' %
+                          (len(sources) * 2, elem_count * reduce_count))
+
+    if reduce_count <= 1:
+      raise ArgumentError('Unsupported reduce_count: %d' % reduce_count)
+
+    while reduce_count > 1:
+      if len(sources) % 2 == 1:
+        sources.append(sources[-1])
+
+      if reduce_count == 2:
+        for i in range(len(sources) / 2):
+          self.EmitVPadd(reduce_type, destinations[i], sources[2 * i],
+                         sources[2 * i + 1])
+        return
+      else:
+        sources_2 = []
+        for i in range(len(sources) / 2):
+          self.EmitVPadd(reduce_type, sources[2 * i], sources[2 * i],
+                         sources[2 * i + 1])
+          sources_2.append(sources[2 * i])
+        reduce_count /= 2
+        sources = sources_2
+
+  def EmitVUzp(self, uzp_type, operand_1, operand_2):
+    self.EmitOp2('vuzp.%d' % uzp_type, operand_1, operand_2)
+
+  def EmitVTrn(self, trn_type, operand_1, operand_2):
+    self.EmitOp2('vtrn.%d' % trn_type, operand_1, operand_2)
+
+  def EmitColBlockStride(self, cols, stride, new_stride):
+    assert cols in [1, 2, 3, 4, 5, 6, 7, 8]
+    if cols in [5, 6, 7]:
+      self.EmitSub(new_stride, stride, self.ImmediateConstant(4))
+
+  def EmitLoadColBlock(self, unused_registers, load_type, cols, elements, block,
+                       input_address, stride):
+    """Load a block of column major data."""
+    assert cols is len(block)
+    assert load_type is 8
+
+    input_deref = self.Dereference(input_address, None)
+    input_deref_increment = self.DereferenceIncrement(input_address, None)
+
+    if cols is 1:
+      for i in range(elements):
+        self.EmitVLoadOffset(1, 8,
+                             self.Lane(8, block[0], i), input_deref, stride)
+      self.EmitPld(input_address)
+    elif cols is 2:
+      for i in range(elements):
+        self.EmitVLoadOffset(1, 16,
+                             self.Lane(16, block[i / 4], i % 4), input_deref,
+                             stride)
+      self.EmitPld(input_address)
+      self.EmitVUzp(8, block[0], block[1])
+    elif cols is 3:
+      for i in range(elements):
+        self.EmitVLoadOffsetA(3, 8, [self.Lane(8, row, i) for row in block],
+                              input_deref, stride)
+    elif cols is 4:
+      for i in range(elements):
+        self.EmitVLoadOffset(1, 32,
+                             self.Lane(32, block[i % 4], i / 4), input_deref,
+                             stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn(16, block[0], block[2])
+      self.EmitVTrn(16, block[1], block[3])
+      self.EmitVTrn(8, block[0], block[1])
+      self.EmitVTrn(8, block[2], block[3])
+    elif cols is 5:
+      for i in range(elements):
+        self.EmitVLoad(1, 32,
+                       self.Lane(32, block[i % 4], i / 4),
+                       input_deref_increment)
+        self.EmitVLoadOffset(1, 8,
+                             self.Lane(8, block[4], i), input_deref, stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn(16, block[0], block[2])
+      self.EmitVTrn(16, block[1], block[3])
+      self.EmitVTrn(8, block[0], block[1])
+      self.EmitVTrn(8, block[2], block[3])
+    elif cols is 6:
+      for i in range(elements):
+        self.EmitVLoad(1, 32,
+                       self.Lane(32, block[i % 4], i / 4),
+                       input_deref_increment)
+        self.EmitVLoadOffset(1, 16,
+                             self.Lane(16, block[4 + i / 4], i % 4),
+                             input_deref, stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn(16, block[0], block[2])
+      self.EmitVTrn(16, block[1], block[3])
+      self.EmitVUzp(8, block[4], block[5])
+      self.EmitVTrn(8, block[0], block[1])
+      self.EmitVTrn(8, block[2], block[3])
+    elif cols is 7:
+      for i in range(elements):
+        self.EmitVLoad(1, 32,
+                       self.Lane(32, block[i % 4], i / 4),
+                       input_deref_increment)
+        self.EmitVLoadOffsetA(3, 8,
+                              [self.Lane(8, row, i) for row in block[4:]],
+                              input_deref, stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn(16, block[0], block[2])
+      self.EmitVTrn(16, block[1], block[3])
+      self.EmitVTrn(8, block[0], block[1])
+      self.EmitVTrn(8, block[2], block[3])
+    elif cols is 8:
+      for i in range(elements):
+        self.EmitVLoadOffset(1, 32, block[i], input_deref, stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn(8, block[0], block[1])
+      self.EmitVTrn(8, block[2], block[3])
+      self.EmitVTrn(8, block[4], block[5])
+      self.EmitVTrn(8, block[6], block[7])
+      self.EmitVTrn(16, block[0], block[2])
+      self.EmitVTrn(16, block[1], block[3])
+      self.EmitVTrn(16, block[4], block[6])
+      self.EmitVTrn(16, block[5], block[7])
+      self.EmitVTrn(32, block[0], block[4])
+      self.EmitVTrn(32, block[1], block[5])
+      self.EmitVTrn(32, block[2], block[6])
+      self.EmitVTrn(32, block[3], block[7])
+    else:
+      assert False
+    return block
+
+  def Dereference(self, value, alignment=None):
     if alignment:
       return '[%s:%d]' % (value, alignment)
     else:
       return '[%s]' % value
 
-  def DereferenceIncrement(self, value, alignment):
+  def DereferenceIncrement(self, value, alignment=None):
     return '%s!' % self.Dereference(value, alignment)
 
   def ImmediateConstant(self, value):
@@ -347,5 +822,20 @@
   def AllLanes(self, value):
     return '%s[]' % value
 
-  def Lane(self, value, lane):
-    return '%s[%d]' % (value, lane)
+  def Lane(self, bits, value, lane):
+    """Get the proper n-bit lane from the given register."""
+    registers = []
+    if value[0] == 'q':
+      registers.append(_Low(value))
+      registers.append(_High(value))
+    else:
+      registers.append(value)
+
+    elems_per_register = 64 / bits
+    register = lane / elems_per_register
+    lane %= elems_per_register
+
+    return '%s[%d]' % (registers[register], lane)
+
+  def CreateRegisters(self):
+    return _NeonRegisters32Bit()
diff --git a/meta/generators/neon_emitter_64.py b/meta/generators/neon_emitter_64.py
new file mode 100644
index 0000000..13a0715
--- /dev/null
+++ b/meta/generators/neon_emitter_64.py
@@ -0,0 +1,1277 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""64bit ARM/NEON assembly emitter.
+
+Used by code generators to produce ARM assembly with NEON simd code.
+Provides tools for easier register management: named register variable
+allocation/deallocation, and offers a more procedural/structured approach
+to generating assembly.
+
+"""
+
+_WIDE_TYPES = {
+    8: 16,
+    16: 32,
+    32: 64,
+    '8': '16',
+    '16': '32',
+    '32': '64',
+    'i8': 'i16',
+    'i16': 'i32',
+    'i32': 'i64',
+    'u8': 'u16',
+    'u16': 'u32',
+    'u32': 'u64',
+    's8': 's16',
+    's16': 's32',
+    's32': 's64'
+}
+
+_NARROW_TYPES = {
+    64: 32,
+    32: 16,
+    16: 8,
+    '64': '32',
+    '32': '16',
+    '16': '8',
+    'i64': 'i32',
+    'i32': 'i16',
+    'i16': 'i8',
+    'u64': 'u32',
+    'u32': 'u16',
+    'u16': 'u8',
+    's64': 's32',
+    's32': 's16',
+    's16': 's8'
+}
+
+_TYPE_BITS = {
+    8: 8,
+    16: 16,
+    32: 32,
+    64: 64,
+    '8': 8,
+    '16': 16,
+    '32': 32,
+    '64': 64,
+    'i8': 8,
+    'i16': 16,
+    'i32': 32,
+    'i64': 64,
+    'u8': 8,
+    'u16': 16,
+    'u32': 32,
+    'u64': 64,
+    's8': 8,
+    's16': 16,
+    's32': 32,
+    's64': 64,
+    'f32': 32,
+    'f64': 64,
+    'b': 8,
+    'h': 16,
+    's': 32,
+    'd': 64
+}
+
+
+class Error(Exception):
+  """Module level error."""
+
+
+class RegisterAllocationError(Error):
+  """Cannot alocate registers."""
+
+
+class LaneError(Error):
+  """Wrong lane number."""
+
+
+class RegisterSubtypeError(Error):
+  """The register needs to be lane-typed."""
+
+
+class ArgumentError(Error):
+  """Wrong argument."""
+
+
+def _AppendType(type_name, register):
+  """Calculates sizes and attaches the type information to the register."""
+  if register.register_type is not 'v':
+    raise ArgumentError('Only vector registers can have type appended.')
+
+  if type_name in set([8, '8', 'i8', 's8', 'u8']):
+    subtype = 'b'
+    subtype_bits = 8
+  elif type_name in set([16, '16', 'i16', 's16', 'u16']):
+    subtype = 'h'
+    subtype_bits = 16
+  elif type_name in set([32, '32', 'i32', 's32', 'u32', 'f32']):
+    subtype = 's'
+    subtype_bits = 32
+  elif type_name in set([64, '64', 'i64', 's64', 'u64', 'f64']):
+    subtype = 'd'
+    subtype_bits = 64
+  else:
+    raise ArgumentError('Unknown type: %s' % type_name)
+
+  new_register = register.Copy()
+  new_register.register_subtype = subtype
+  new_register.register_subtype_count = register.register_bits / subtype_bits
+  return new_register
+
+
+def _UnsignedType(type_name):
+  return type_name in set(['u8', 'u16', 'u32', 'u64'])
+
+
+def _FloatType(type_name):
+  return type_name in set(['f32', 'f64'])
+
+
+def _WideType(type_name):
+  if type_name in _WIDE_TYPES.keys():
+    return _WIDE_TYPES[type_name]
+  else:
+    raise ArgumentError('No wide type for: %s' % type_name)
+
+
+def _NarrowType(type_name):
+  if type_name in _NARROW_TYPES.keys():
+    return _NARROW_TYPES[type_name]
+  else:
+    raise ArgumentError('No narrow type for: %s' % type_name)
+
+
+def _LoadStoreSize(register):
+  if register.lane is None:
+    return register.register_bits
+  else:
+    return register.lane_bits
+
+
+def _MakeCompatibleDown(reg_1, reg_2, reg_3):
+  bits = min([reg_1.register_bits, reg_2.register_bits, reg_3.register_bits])
+  return (_Cast(bits, reg_1), _Cast(bits, reg_2), _Cast(bits, reg_3))
+
+
+def _MakeCompatibleUp(reg_1, reg_2, reg_3):
+  bits = max([reg_1.register_bits, reg_2.register_bits, reg_3.register_bits])
+  return (_Cast(bits, reg_1), _Cast(bits, reg_2), _Cast(bits, reg_3))
+
+
+def _Cast(bits, reg):
+  if reg.register_bits is bits:
+    return reg
+  else:
+    new_reg = reg.Copy()
+    new_reg.register_bits = bits
+    return new_reg
+
+
+def _TypeBits(type_name):
+  if type_name in _TYPE_BITS.keys():
+    return _TYPE_BITS[type_name]
+  else:
+    raise ArgumentError('Unknown type: %s' % type_name)
+
+
+def _RegisterList(list_type, registers):
+  lanes = list(set([register.lane for register in registers]))
+  if len(lanes) > 1:
+    raise ArgumentError('Cannot mix lanes on a register list.')
+  typed_registers = [_AppendType(list_type, register) for register in registers]
+
+  if lanes[0] is None:
+    return '{%s}' % ', '.join(map(str, typed_registers))
+  elif lanes[0] is -1:
+    raise ArgumentError('Cannot construct a list with all lane indexing.')
+  else:
+    typed_registers_nolane = [register.Copy() for register in typed_registers]
+    for register in typed_registers_nolane:
+      register.lane = None
+      register.register_subtype_count = None
+    return '{%s}[%d]' % (', '.join(map(str, typed_registers_nolane)), lanes[0])
+
+
+class _GeneralRegister(object):
+  """Arm v8 general register: (x|w)n."""
+
+  def __init__(self,
+               register_bits,
+               number,
+               dereference=False,
+               dereference_increment=False):
+    self.register_type = 'r'
+    self.register_bits = register_bits
+    self.number = number
+    self.dereference = dereference
+    self.dereference_increment = dereference_increment
+
+  def Copy(self):
+    return _GeneralRegister(self.register_bits, self.number, self.dereference,
+                            self.dereference_increment)
+
+  def __repr__(self):
+    if self.register_bits is 64:
+      text = 'x%d' % self.number
+    elif self.register_bits <= 32:
+      text = 'w%d' % self.number
+    else:
+      raise RegisterSubtypeError('Wrong bits (%d) for general register: %d' %
+                                 (self.register_bits, self.number))
+    if self.dereference:
+      return '[%s]' % text
+    else:
+      return text
+
+
+class _MappedParameter(object):
+  """Object representing a C variable mapped to a register."""
+
+  def __init__(self,
+               name,
+               register_bits=64,
+               dereference=False,
+               dereference_increment=False):
+    self.name = name
+    self.register_bits = register_bits
+    self.dereference = dereference
+    self.dereference_increment = dereference_increment
+
+  def Copy(self):
+    return _MappedParameter(self.name, self.register_bits, self.dereference,
+                            self.dereference_increment)
+
+  def __repr__(self):
+    if self.register_bits is None:
+      text = '%%[%s]' % self.name
+    elif self.register_bits is 64:
+      text = '%%x[%s]' % self.name
+    elif self.register_bits <= 32:
+      text = '%%w[%s]' % self.name
+    else:
+      raise RegisterSubtypeError('Wrong bits (%d) for mapped parameter: %s' %
+                                 (self.register_bits, self.name))
+    if self.dereference:
+      return '[%s]' % text
+    else:
+      return text
+
+
+class _VectorRegister(object):
+  """Arm v8 vector register Vn.TT."""
+
+  def __init__(self,
+               register_bits,
+               number,
+               register_subtype=None,
+               register_subtype_count=None,
+               lane=None,
+               lane_bits=None):
+    self.register_type = 'v'
+    self.register_bits = register_bits
+    self.number = number
+    self.register_subtype = register_subtype
+    self.register_subtype_count = register_subtype_count
+    self.lane = lane
+    self.lane_bits = lane_bits
+
+  def Copy(self):
+    return _VectorRegister(self.register_bits, self.number,
+                           self.register_subtype, self.register_subtype_count,
+                           self.lane, self.lane_bits)
+
+  def __repr__(self):
+    if self.register_subtype is None:
+      raise RegisterSubtypeError('Register: %s%d has no lane types defined.' %
+                                 (self.register_type, self.number))
+    if (self.register_subtype_count is None or (self.lane is not None and
+                                                self.lane is not -1)):
+      typed_name = '%s%d.%s' % (self.register_type, self.number,
+                                self.register_subtype)
+    else:
+      typed_name = '%s%d.%d%s' % (self.register_type, self.number,
+                                  self.register_subtype_count,
+                                  self.register_subtype)
+
+    if self.lane is None or self.lane is -1:
+      return typed_name
+    elif self.lane >= 0 and self.lane < self.register_subtype_count:
+      return '%s[%d]' % (typed_name, self.lane)
+    else:
+      raise LaneError('Wrong lane: %d for: %s' % (self.lane, typed_name))
+
+
+class _ImmediateConstant(object):
+
+  def __init__(self, value):
+    self.register_type = 'i'
+    self.value = value
+
+  def Copy(self):
+    return _ImmediateConstant(self.value)
+
+  def __repr__(self):
+    return '#%d' % self.value
+
+
+class _NeonRegisters64Bit(object):
+  """Utility that keeps track of used 32bit ARM/NEON registers."""
+
+  def __init__(self):
+    self.vector = set()
+    self.vector_ever = set()
+    self.general = set()
+    self.general_ever = set()
+    self.parameters = dict()
+    self.output_parameters = dict()
+
+  def MapParameter(self, parameter, parameter_value=None):
+    if not parameter_value:
+      parameter_value = parameter
+    self.parameters[parameter] = (parameter_value, 'r')
+    return _MappedParameter(parameter)
+
+  def MapMemoryParameter(self, parameter, parameter_value=None):
+    if not parameter_value:
+      parameter_value = parameter
+    self.parameters[parameter] = (parameter_value, 'm')
+    return _MappedParameter(parameter)
+
+  def MapOutputParameter(self, parameter, parameter_value=None):
+    if not parameter_value:
+      parameter_value = parameter
+    self.output_parameters[parameter] = (parameter_value, '+r')
+    return _MappedParameter(parameter)
+
+  def _VectorRegisterNum(self, min_val=0):
+    for i in range(min_val, 32):
+      if i not in self.vector:
+        self.vector.add(i)
+        self.vector_ever.add(i)
+        return i
+    raise RegisterAllocationError('Not enough vector registers.')
+
+  def DoubleRegister(self, min_val=0):
+    return _VectorRegister(64, self._VectorRegisterNum(min_val))
+
+  def QuadRegister(self, min_val=0):
+    return _VectorRegister(128, self._VectorRegisterNum(min_val))
+
+  def GeneralRegister(self):
+    for i in range(0, 30):
+      if i not in self.general:
+        self.general.add(i)
+        self.general_ever.add(i)
+        return _GeneralRegister(64, i)
+    raise RegisterAllocationError('Not enough general registers.')
+
+  def MappedParameters(self):
+    return [x for x in self.parameters.items()]
+
+  def MappedOutputParameters(self):
+    return [x for x in self.output_parameters.items()]
+
+  def Clobbers(self):
+    return (
+        ['x%d' % i
+         for i in self.general_ever] + ['v%d' % i for i in self.vector_ever])
+
+  def FreeRegister(self, register):
+    if isinstance(register, _MappedParameter):
+      return
+
+    if register.register_type == 'v':
+      assert register.number in self.vector
+      self.vector.remove(register.number)
+    elif register.register_type == 'r':
+      assert register.number in self.general
+      self.general.remove(register.number)
+    else:
+      raise RegisterAllocationError('Register not allocated: %s%d' %
+                                    (register.register_type, register.number))
+
+  def FreeRegisters(self, registers):
+    for register in registers:
+      self.FreeRegister(register)
+
+
+class NeonEmitter64(object):
+  """Emits ARM/NEON 64bit assembly opcodes."""
+
+  def __init__(self, debug=False):
+    self.ops = {}
+    self.indent = ''
+    self.debug = debug
+
+  def PushIndent(self, delta_indent='  '):
+    self.indent += delta_indent
+
+  def PopIndent(self, delta=2):
+    self.indent = self.indent[:-delta]
+
+  def EmitIndented(self, what):
+    print self.indent + what
+
+  def PushOp(self, op):
+    if op in self.ops.keys():
+      self.ops[op] += 1
+    else:
+      self.ops[op] = 1
+
+  def ClearCounters(self):
+    self.ops.clear()
+
+  def EmitNewline(self):
+    print ''
+
+  def EmitPreprocessor1(self, op, param):
+    print '#%s %s' % (op, param)
+
+  def EmitPreprocessor(self, op):
+    print '#%s' % op
+
+  def EmitInclude(self, include):
+    self.EmitPreprocessor1('include', include)
+
+  def EmitCall1(self, function, param):
+    self.EmitIndented('%s(%s);' % (function, param))
+
+  def EmitAssert(self, assert_expression):
+    if self.debug:
+      self.EmitCall1('assert', assert_expression)
+
+  def EmitHeaderBegin(self, header_name, includes):
+    self.EmitPreprocessor1('ifndef', (header_name + '_H_').upper())
+    self.EmitPreprocessor1('define', (header_name + '_H_').upper())
+    self.EmitNewline()
+    if includes:
+      for include in includes:
+        self.EmitInclude(include)
+      self.EmitNewline()
+
+  def EmitHeaderEnd(self):
+    self.EmitPreprocessor('endif')
+
+  def EmitCode(self, code):
+    self.EmitIndented('%s;' % code)
+
+  def EmitFunctionBeginA(self, function_name, params, return_type):
+    self.EmitIndented('%s %s(%s) {' %
+                      (return_type, function_name,
+                       ', '.join(['%s %s' % (t, n) for (t, n) in params])))
+    self.PushIndent()
+
+  def EmitFunctionEnd(self):
+    self.PopIndent()
+    self.EmitIndented('}')
+
+  def EmitAsmBegin(self):
+    self.EmitIndented('asm volatile(')
+    self.PushIndent()
+
+  def EmitAsmMapping(self, elements):
+    if elements:
+      self.EmitIndented(': ' + ', '.join(
+          ['[%s] "%s"(%s)' % (k, v[1], v[0]) for (k, v) in elements]))
+    else:
+      self.EmitIndented(':')
+
+  def EmitClobbers(self, elements):
+    if elements:
+      self.EmitIndented(': ' + ', '.join(['"%s"' % c for c in elements]))
+    else:
+      self.EmitIndented(':')
+
+  def EmitAsmEnd(self, registers):
+    self.EmitAsmMapping(registers.MappedOutputParameters())
+    self.EmitAsmMapping(registers.MappedParameters())
+    self.EmitClobbers(registers.Clobbers() + ['cc', 'memory'])
+    self.PopIndent()
+    self.EmitIndented(');')
+
+  def EmitComment(self, comment):
+    self.EmitIndented('// ' + comment)
+
+  def EmitNumericalLabel(self, label):
+    self.EmitIndented('"%d:"' % label)
+
+  def EmitOp1(self, op, param1):
+    self.PushOp(op)
+    self.EmitIndented('"%s %s\\n"' % (op, param1))
+
+  def EmitOp2(self, op, param1, param2):
+    self.PushOp(op)
+    self.EmitIndented('"%s %s, %s\\n"' % (op, param1, param2))
+
+  def EmitOp3(self, op, param1, param2, param3):
+    self.PushOp(op)
+    self.EmitIndented('"%s %s, %s, %s\\n"' % (op, param1, param2, param3))
+
+  def EmitAdd(self, destination, source, param):
+    self.EmitOp3('add', destination, source, param)
+
+  def EmitSubs(self, destination, source, param):
+    self.EmitOp3('subs', destination, source, param)
+
+  def EmitSub(self, destination, source, param):
+    self.EmitOp3('sub', destination, source, param)
+
+  def EmitMul(self, destination, source, param):
+    self.EmitOp3('mul', destination, source, param)
+
+  def EmitMov(self, param1, param2):
+    self.EmitOp2('mov', param1, param2)
+
+  def EmitVMovl(self, mov_type, destination, source):
+    wide_type = _WideType(mov_type)
+    destination = _AppendType(wide_type, destination)
+    source = _AppendType(mov_type, _Cast(source.register_bits / 2, source))
+    if _UnsignedType(mov_type):
+      self.EmitOp2('uxtl', destination, source)
+    else:
+      self.EmitOp2('sxtl', destination, source)
+
+  def EmitVMovl2(self, mov_type, destination_1, destination_2, source):
+    wide_type = _WideType(mov_type)
+    if (destination_1.register_bits != source.register_bits or
+        destination_2.register_bits != source.register_bits):
+      raise ArgumentError('Register sizes do not match.')
+    if _UnsignedType(mov_type):
+      self.EmitOp2('uxtl2',
+                   _AppendType(wide_type, destination_2),
+                   _AppendType(mov_type, source))
+      self.EmitOp2('uxtl',
+                   _AppendType(wide_type, destination_1),
+                   _AppendType(mov_type,
+                               _Cast(source.register_bits / 2, source)))
+    else:
+      self.EmitOp2('sxtl2',
+                   _AppendType(wide_type, destination_2),
+                   _AppendType(mov_type, source))
+      self.EmitOp2('sxtl',
+                   _AppendType(wide_type, destination_1),
+                   _AppendType(mov_type,
+                               _Cast(source.register_bits / 2, source)))
+
+  def EmitVMax(self, max_type, destination, source_1, source_2):
+    if _UnsignedType(max_type):
+      self.EmitOp3('umax',
+                   _AppendType(max_type, destination),
+                   _AppendType(max_type, source_1),
+                   _AppendType(max_type, source_2))
+    else:
+      self.EmitOp3('smax',
+                   _AppendType(max_type, destination),
+                   _AppendType(max_type, source_1),
+                   _AppendType(max_type, source_2))
+
+  def EmitVMin(self, min_type, destination, source_1, source_2):
+    if _UnsignedType(min_type):
+      self.EmitOp3('umin',
+                   _AppendType(min_type, destination),
+                   _AppendType(min_type, source_1),
+                   _AppendType(min_type, source_2))
+    else:
+      self.EmitOp3('smin',
+                   _AppendType(min_type, destination),
+                   _AppendType(min_type, source_1),
+                   _AppendType(min_type, source_2))
+
+  def EmitBeqBack(self, label):
+    self.EmitOp1('beq', '%db' % label)
+
+  def EmitBeqFront(self, label):
+    self.EmitOp1('beq', '%df' % label)
+
+  def EmitBgtBack(self, label):
+    self.EmitOp1('bgt', '%db' % label)
+
+  def EmitBgtFront(self, label):
+    self.EmitOp1('bgt', '%df' % label)
+
+  def EmitBleBack(self, label):
+    self.EmitOp1('ble', '%db' % label)
+
+  def EmitBleFront(self, label):
+    self.EmitOp1('ble', '%df' % label)
+
+  def EmitBneBack(self, label):
+    self.EmitOp1('bne', '%db' % label)
+
+  def EmitBneFront(self, label):
+    self.EmitOp1('bne', '%df' % label)
+
+  def EmitVAdd(self, add_type, destination, source_1, source_2):
+    destination, source_1, source_2 = _MakeCompatibleDown(destination, source_1,
+                                                          source_2)
+    if _FloatType(add_type):
+      self.EmitOp3('fadd',
+                   _AppendType(add_type, destination),
+                   _AppendType(add_type, source_1),
+                   _AppendType(add_type, source_2))
+    else:
+      self.EmitOp3('add',
+                   _AppendType(add_type, destination),
+                   _AppendType(add_type, source_1),
+                   _AppendType(add_type, source_2))
+
+  def EmitVAddw(self, add_type, destination, source_1, source_2):
+    wide_type = _WideType(add_type)
+    destination = _AppendType(wide_type, destination)
+    source_1 = _AppendType(wide_type, source_1)
+    source_2 = _AppendType(add_type, source_2)
+    if _UnsignedType(add_type):
+      self.EmitOp3('uaddw', destination, source_1, source_2)
+    else:
+      self.EmitOp3('saddw', destination, source_1, source_2)
+
+  def EmitVSub(self, sub_type, destination, source_1, source_2):
+    destination, source_1, source_2 = _MakeCompatibleDown(destination, source_1,
+                                                          source_2)
+    if _FloatType(sub_type):
+      self.EmitOp3('fsub',
+                   _AppendType(sub_type, destination),
+                   _AppendType(sub_type, source_1),
+                   _AppendType(sub_type, source_2))
+    else:
+      self.EmitOp3('sub',
+                   _AppendType(sub_type, destination),
+                   _AppendType(sub_type, source_1),
+                   _AppendType(sub_type, source_2))
+
+  def EmitVCvt(self, cvt_to, cvt_from, destination, source):
+    if cvt_to == 'f32' and cvt_from == 's32':
+      self.EmitOp2('scvtf',
+                   _AppendType('f32', destination), _AppendType('s32', source))
+    elif cvt_to == 'f32' and cvt_from == 'u32':
+      self.EmitOp2('ucvtf',
+                   _AppendType('f32', destination), _AppendType('u32', source))
+    elif cvt_to == 's32' and cvt_from == 'f32':
+      self.EmitOp2('fcvtzs',
+                   _AppendType('s32', destination), _AppendType('f32', source))
+    else:
+      raise ArgumentError('Convert not supported, to: %s from: %s' % (cvt_to,
+                                                                      cvt_from))
+
+  def EmitVDup(self, dup_type, destination, source):
+    if (isinstance(source, _GeneralRegister) or
+        isinstance(source, _MappedParameter)):
+      self.EmitOp2('dup',
+                   _AppendType(dup_type, destination),
+                   _Cast(_TypeBits(dup_type), source))
+    else:
+      self.EmitOp2('dup',
+                   _AppendType(dup_type, destination),
+                   _AppendType(dup_type, source))
+
+  def EmitVMov(self, mov_type, destination, source):
+    if isinstance(source, _ImmediateConstant):
+      self.EmitOp2('movi', _AppendType(mov_type, destination), source)
+    elif (isinstance(source, _GeneralRegister) or
+          isinstance(source, _MappedParameter)):
+      self.EmitOp2('mov',
+                   _AppendType(mov_type, destination),
+                   _Cast(_TypeBits(mov_type), source))
+    else:
+      self.EmitOp2('mov', _AppendType(8, destination), _AppendType(8, source))
+
+  def EmitVQmovn(self, mov_type, destination, source):
+    narrow_type = _NarrowType(mov_type)
+    if destination.register_bits * 2 == source.register_bits:
+      self.EmitOp2('sqxtn',
+                   _AppendType(narrow_type, destination),
+                   _AppendType(mov_type, source))
+    elif destination.register_bits == source.register_bits:
+      self.EmitOp2('sqxtn',
+                   _AppendType(narrow_type,
+                               _Cast(destination.register_bits / 2,
+                                     destination)),
+                   _AppendType(mov_type, source))
+
+  def EmitVQmovn2(self, mov_type, destination, source_1, source_2):
+    narrow_type = _NarrowType(mov_type)
+    if (destination.register_bits != source_1.register_bits or
+        destination.register_bits != source_2.register_bits):
+      raise ArgumentError('Register sizes do not match.')
+    self.EmitOp2('sqxtn',
+                 _AppendType(narrow_type,
+                             _Cast(destination.register_bits / 2, destination)),
+                 _AppendType(mov_type, source_1))
+    self.EmitOp2('sqxtn2',
+                 _AppendType(narrow_type, destination),
+                 _AppendType(mov_type, source_2))
+
+  def EmitVQmovun(self, mov_type, destination, source):
+    narrow_type = _NarrowType(mov_type)
+    if destination.register_bits * 2 == source.register_bits:
+      self.EmitOp2('sqxtun',
+                   _AppendType(narrow_type, destination),
+                   _AppendType(mov_type, source))
+    elif destination.register_bits == source.register_bits:
+      self.EmitOp2('sqxtun',
+                   _AppendType(narrow_type,
+                               _Cast(destination.register_bits / 2,
+                                     destination)),
+                   _AppendType(mov_type, source))
+
+  def EmitVQmovun2(self, mov_type, destination, source_1, source_2):
+    narrow_type = _NarrowType(mov_type)
+    if (destination.register_bits != source_1.register_bits or
+        destination.register_bits != source_2.register_bits):
+      raise ArgumentError('Register sizes do not match.')
+    self.EmitOp2('sqxtun',
+                 _AppendType(narrow_type,
+                             _Cast(destination.register_bits / 2, destination)),
+                 _AppendType(mov_type, source_1))
+    self.EmitOp2('sqxtun2',
+                 _AppendType(narrow_type, destination),
+                 _AppendType(mov_type, source_2))
+
+  def EmitVMul(self, mul_type, destination, source_1, source_2):
+    destination, source_1, source_2 = _MakeCompatibleDown(destination, source_1,
+                                                          source_2)
+    if _FloatType(mul_type):
+      self.EmitOp3('fmul',
+                   _AppendType(mul_type, destination),
+                   _AppendType(mul_type, source_1),
+                   _AppendType(mul_type, source_2))
+    else:
+      self.EmitOp3('mul',
+                   _AppendType(mul_type, destination),
+                   _AppendType(mul_type, source_1),
+                   _AppendType(mul_type, source_2))
+
+  def EmitVMulScalar(self, mul_type, destination, source_1, source_2):
+    self.EmitOp3('mul',
+                 _AppendType(mul_type, destination),
+                 _AppendType(mul_type, source_1),
+                 _AppendType(mul_type, source_2))
+
+  def EmitVMull(self, mul_type, destination, source_1, source_2):
+    wide_type = _WideType(mul_type)
+    if _UnsignedType(mul_type):
+      self.EmitOp3('umull',
+                   _AppendType(wide_type, destination),
+                   _AppendType(mul_type, source_1),
+                   _AppendType(mul_type, source_2))
+    else:
+      self.EmitOp3('smull',
+                   _AppendType(wide_type, destination),
+                   _AppendType(mul_type, source_1),
+                   _AppendType(mul_type, source_2))
+
+  def EmitVPadd(self, add_type, destination, source_1, source_2):
+    self.EmitOp3('addp',
+                 _AppendType(add_type, destination),
+                 _AppendType(add_type, source_1),
+                 _AppendType(add_type, source_2))
+
+  def EmitVPaddl(self, add_type, destination, source):
+    wide_type = _WideType(add_type)
+    if _UnsignedType(add_type):
+      self.EmitOp2('uaddlp',
+                   _AppendType(wide_type, destination),
+                   _AppendType(add_type, source))
+    else:
+      self.EmitOp2('saddlp',
+                   _AppendType(wide_type, destination),
+                   _AppendType(add_type, source))
+
+  def EmitVPadal(self, add_type, destination, source):
+    wide_type = _WideType(add_type)
+    if _UnsignedType(add_type):
+      self.EmitOp2('uadalp',
+                   _AppendType(wide_type, destination),
+                   _AppendType(add_type, source))
+    else:
+      self.EmitOp2('sadalp',
+                   _AppendType(wide_type, destination),
+                   _AppendType(add_type, source))
+
+  def EmitLdr(self, register, value):
+    self.EmitOp2('ldr', _Cast(32, register), _Cast(None, value))
+
+  def EmitVLoad(self, load_no, load_type, destination, source):
+    self.EmitVLoadA(load_no, load_type, [destination], source)
+
+  def EmitVLoadA(self, load_no, load_type, destinations, source):
+    if source.dereference_increment:
+      increment = sum(
+          [_LoadStoreSize(destination) for destination in destinations]) / 8
+      self.EmitVLoadAPostIncrement(load_no, load_type, destinations, source,
+                                   self.ImmediateConstant(increment))
+    else:
+      self.EmitVLoadAPostIncrement(load_no, load_type, destinations, source,
+                                   None)
+
+  def EmitVLoadAPostIncrement(self, load_no, load_type, destinations, source,
+                              increment):
+    """Generate assembly to load memory to registers and increment source."""
+    if len(destinations) == 1 and destinations[0].lane is -1:
+      destination = '{%s}' % _AppendType(load_type, destinations[0])
+      if increment:
+        self.EmitOp3('ld%dr' % load_no, destination, source, increment)
+      else:
+        self.EmitOp2('ld%dr' % load_no, destination, source)
+      return
+
+    destination_list = _RegisterList(load_type, destinations)
+    if increment:
+      self.EmitOp3('ld%d' % load_no, destination_list, source, increment)
+    else:
+      self.EmitOp2('ld%d' % load_no, destination_list, source)
+
+  def EmitVLoadAE(self,
+                  load_type,
+                  elem_count,
+                  destinations,
+                  source,
+                  alignment=None):
+    """Generate assembly to load an array of elements of given size."""
+    bits_to_load = load_type * elem_count
+    min_bits = min([destination.register_bits for destination in destinations])
+    max_bits = max([destination.register_bits for destination in destinations])
+
+    if min_bits is not max_bits:
+      raise ArgumentError('Cannot mix double and quad loads.')
+
+    if len(destinations) * min_bits < bits_to_load:
+      raise ArgumentError('To few destinations: %d to load %d bits.' %
+                          (len(destinations), bits_to_load))
+
+    leftover_loaded = 0
+    while bits_to_load > 0:
+      if bits_to_load >= 4 * min_bits:
+        self.EmitVLoadA(1, 32, destinations[:4],
+                        self.DereferenceIncrement(source, alignment))
+        bits_to_load -= 4 * min_bits
+        destinations = destinations[4:]
+      elif bits_to_load >= 3 * min_bits:
+        self.EmitVLoadA(1, 32, destinations[:3],
+                        self.DereferenceIncrement(source, alignment))
+        bits_to_load -= 3 * min_bits
+        destinations = destinations[3:]
+      elif bits_to_load >= 2 * min_bits:
+        self.EmitVLoadA(1, 32, destinations[:2],
+                        self.DereferenceIncrement(source, alignment))
+        bits_to_load -= 2 * min_bits
+        destinations = destinations[2:]
+      elif bits_to_load >= min_bits:
+        self.EmitVLoad(1, 32, destinations[0],
+                       self.DereferenceIncrement(source, alignment))
+        bits_to_load -= min_bits
+        destinations = destinations[1:]
+      elif bits_to_load >= 64:
+        self.EmitVLoad(1, 32,
+                       _Cast(64, destinations[0]),
+                       self.DereferenceIncrement(source))
+        bits_to_load -= 64
+        leftover_loaded += 64
+      elif bits_to_load >= 32:
+        self.EmitVLoad(1, 32,
+                       self.Lane(32, destinations[0], leftover_loaded / 32),
+                       self.DereferenceIncrement(source))
+        bits_to_load -= 32
+        leftover_loaded += 32
+      elif bits_to_load >= 16:
+        self.EmitVLoad(1, 16,
+                       self.Lane(16, destinations[0], leftover_loaded / 16),
+                       self.DereferenceIncrement(source))
+        bits_to_load -= 16
+        leftover_loaded += 16
+      elif bits_to_load is 8:
+        self.EmitVLoad(1, 8,
+                       self.Lane(8, destinations[0], leftover_loaded / 8),
+                       self.DereferenceIncrement(source))
+        bits_to_load -= 8
+        leftover_loaded += 8
+      else:
+        raise ArgumentError('Wrong leftover: %d' % bits_to_load)
+
+  def EmitVLoadE(self, load_type, count, destination, source, alignment=None):
+    self.EmitVLoadAE(load_type, count, [destination], source, alignment)
+
+  def EmitVLoadAllLanes(self, load_no, load_type, destination, source):
+    new_destination = destination.Copy()
+    new_destination.lane = -1
+    new_destination.lane_bits = load_type
+    self.EmitVLoad(load_no, load_type, new_destination, source)
+
+  def EmitVLoadOffset(self, load_no, load_type, destination, source, offset):
+    self.EmitVLoadOffsetA(load_no, load_type, [destination], source, offset)
+
+  def EmitVLoadOffsetA(self, load_no, load_type, destinations, source, offset):
+    assert len(destinations) <= 4
+    self.EmitOp3('ld%d' % load_no,
+                 _RegisterList(load_type, destinations), source, offset)
+
+  def EmitPld(self, load_address_register):
+    self.EmitOp2('prfm', 'pldl1keep', '[%s]' % load_address_register)
+
+  def EmitPldOffset(self, load_address_register, offset):
+    self.EmitOp2('prfm', 'pldl1keep',
+                 '[%s, %s]' % (load_address_register, offset))
+
+  def EmitVShl(self, shift_type, destination, source, shift):
+    self.EmitOp3('sshl',
+                 _AppendType(shift_type, destination),
+                 _AppendType(shift_type, source), _AppendType('i32', shift))
+
+  def EmitVStore(self, store_no, store_type, source, destination):
+    self.EmitVStoreA(store_no, store_type, [source], destination)
+
+  def EmitVStoreA(self, store_no, store_type, sources, destination):
+    if destination.dereference_increment:
+      increment = sum([_LoadStoreSize(source) for source in sources]) / 8
+      self.EmitVStoreAPostIncrement(store_no, store_type, sources, destination,
+                                    self.ImmediateConstant(increment))
+    else:
+      self.EmitVStoreAPostIncrement(store_no, store_type, sources, destination,
+                                    None)
+
+  def EmitVStoreAPostIncrement(self, store_no, store_type, sources, destination,
+                               increment):
+    source_list = _RegisterList(store_type, sources)
+    if increment:
+      self.EmitOp3('st%d' % store_no, source_list, destination, increment)
+    else:
+      self.EmitOp2('st%d' % store_no, source_list, destination)
+
+  def EmitVStoreAE(self,
+                   store_type,
+                   elem_count,
+                   sources,
+                   destination,
+                   alignment=None):
+    """Generate assembly to store an array of elements of given size."""
+    bits_to_store = store_type * elem_count
+    min_bits = min([source.register_bits for source in sources])
+    max_bits = max([source.register_bits for source in sources])
+
+    if min_bits is not max_bits:
+      raise ArgumentError('Cannot mix double and quad stores.')
+
+    if len(sources) * min_bits < bits_to_store:
+      raise ArgumentError('To few destinations: %d to store %d bits.' %
+                          (len(sources), bits_to_store))
+
+    leftover_stored = 0
+    while bits_to_store > 0:
+      if bits_to_store >= 4 * min_bits:
+        self.EmitVStoreA(1, 32, sources[:4],
+                         self.DereferenceIncrement(destination, alignment))
+        bits_to_store -= 4 * min_bits
+        sources = sources[4:]
+      elif bits_to_store >= 3 * min_bits:
+        self.EmitVStoreA(1, 32, sources[:3],
+                         self.DereferenceIncrement(destination, alignment))
+        bits_to_store -= 3 * min_bits
+        sources = sources[3:]
+      elif bits_to_store >= 2 * min_bits:
+        self.EmitVStoreA(1, 32, sources[:2],
+                         self.DereferenceIncrement(destination, alignment))
+        bits_to_store -= 2 * min_bits
+        sources = sources[2:]
+      elif bits_to_store >= min_bits:
+        self.EmitVStore(1, 32, sources[0],
+                        self.DereferenceIncrement(destination, alignment))
+        bits_to_store -= min_bits
+        sources = sources[1:]
+      elif bits_to_store >= 64:
+        self.EmitVStore(1, 32,
+                        _Cast(64, sources[0]),
+                        self.DereferenceIncrement(destination, alignment))
+        bits_to_store -= 64
+        leftover_stored += 64
+      elif bits_to_store >= 32:
+        self.EmitVStore(1, 32,
+                        self.Lane(32, sources[0], leftover_stored / 32),
+                        self.DereferenceIncrement(destination))
+        bits_to_store -= 32
+        leftover_stored += 32
+      elif bits_to_store >= 16:
+        self.EmitVStore(1, 16,
+                        self.Lane(16, sources[0], leftover_stored / 16),
+                        self.DereferenceIncrement(destination))
+        bits_to_store -= 16
+        leftover_stored += 16
+      elif bits_to_store >= 8:
+        self.EmitVStore(1, 8,
+                        self.Lane(8, sources[0], leftover_stored / 8),
+                        self.DereferenceIncrement(destination))
+        bits_to_store -= 8
+        leftover_stored += 8
+      else:
+        raise ArgumentError('Wrong leftover: %d' % bits_to_store)
+
+  def EmitVStoreE(self, store_type, count, source, destination, alignment=None):
+    self.EmitVStoreAE(store_type, count, [source], destination, alignment)
+
+  def EmitVStoreOffset(self, store_no, store_type, source, destination, offset):
+    self.EmitVStoreOffsetA(store_no, store_type, [source], destination, offset)
+
+  def EmitVStoreOffsetA(self, store_no, store_type, sources, destination,
+                        offset):
+    self.EmitOp3('st%d' % store_no,
+                 _RegisterList(store_type, sources), destination, offset)
+
+  def EmitVStoreOffsetE(self, store_type, count, source, destination, offset):
+    if store_type is not 32:
+      raise ArgumentError('Unsupported store_type: %d' % store_type)
+
+    if count == 1:
+      self.EmitVStoreOffset(1, 32,
+                            self.Lane(32, source, 0),
+                            self.Dereference(destination, None), offset)
+    elif count == 2:
+      self.EmitVStoreOffset(1, 32,
+                            _Cast(64, source),
+                            self.Dereference(destination, None), offset)
+    elif count == 3:
+      self.EmitVStore(1, 32,
+                      _Cast(64, source),
+                      self.DereferenceIncrement(destination, None))
+      self.EmitVStoreOffset(1, 32,
+                            self.Lane(32, source, 2),
+                            self.Dereference(destination, None), offset)
+      self.EmitSub(destination, destination, self.ImmediateConstant(8))
+    elif count == 4:
+      self.EmitVStoreOffset(1, 32, source,
+                            self.Dereference(destination, None), offset)
+    else:
+      raise ArgumentError('To many elements: %d' % count)
+
+  def EmitVSumReduce(self, reduce_type, elem_count, reduce_count, destinations,
+                     sources):
+    """Generate assembly to perform n-fold horizontal sum reduction."""
+    if reduce_type is not 'u32':
+      raise ArgumentError('Unsupported reduce: %s' % reduce_type)
+
+    if (elem_count + 3) / 4 > len(destinations):
+      raise ArgumentError('To few destinations: %d (%d needed)' %
+                          (len(destinations), (elem_count + 3) / 4))
+
+    if elem_count * reduce_count > len(sources) * 4:
+      raise ArgumentError('To few sources: %d' % len(sources))
+
+    if reduce_count <= 1:
+      raise ArgumentError('Unsupported reduce_count: %d' % reduce_count)
+
+    sources = [_Cast(128, source) for source in sources]
+    destinations = [_Cast(128, destination) for destination in destinations]
+
+    while reduce_count > 1:
+      if len(sources) % 2 == 1:
+        sources.append(sources[-1])
+
+      if reduce_count == 2:
+        for i in range(len(destinations)):
+          self.EmitVPadd(reduce_type, destinations[i], sources[2 * i],
+                         sources[2 * i + 1])
+        return
+      else:
+        sources_2 = []
+        for i in range(len(sources) / 2):
+          self.EmitVPadd(reduce_type, sources[2 * i], sources[2 * i],
+                         sources[2 * i + 1])
+          sources_2.append(sources[2 * i])
+        reduce_count /= 2
+        sources = sources_2
+
+  def EmitVUzp1(self, uzp_type, destination, source_1, source_2):
+    self.EmitOp3('uzp1',
+                 _AppendType(uzp_type, destination),
+                 _AppendType(uzp_type, source_1),
+                 _AppendType(uzp_type, source_2))
+
+  def EmitVUzp2(self, uzp_type, destination, source_1, source_2):
+    self.EmitOp3('uzp2',
+                 _AppendType(uzp_type, destination),
+                 _AppendType(uzp_type, source_1),
+                 _AppendType(uzp_type, source_2))
+
+  def EmitVUzp(self, uzp_type, destination_1, destination_2, source_1,
+               source_2):
+    self.EmitVUzp1(uzp_type, destination_1, source_1, source_2)
+    self.EmitVUzp2(uzp_type, destination_2, source_1, source_2)
+
+  def EmitVTrn1(self, trn_type, destination, source_1, source_2):
+    self.EmitOp3('trn1',
+                 _AppendType(trn_type, destination),
+                 _AppendType(trn_type, source_1),
+                 _AppendType(trn_type, source_2))
+
+  def EmitVTrn2(self, trn_type, destination, source_1, source_2):
+    self.EmitOp3('trn2',
+                 _AppendType(trn_type, destination),
+                 _AppendType(trn_type, source_1),
+                 _AppendType(trn_type, source_2))
+
+  def EmitVTrn(self, trn_type, destination_1, destination_2, source_1,
+               source_2):
+    self.EmitVTrn1(trn_type, destination_1, source_1, source_2)
+    self.EmitVTrn2(trn_type, destination_2, source_1, source_2)
+
+  def EmitColBlockStride(self, cols, stride, new_stride):
+    assert cols in [1, 2, 3, 4, 5, 6, 7, 8]
+    if cols in [5, 6, 7]:
+      self.EmitSub(new_stride, stride, self.ImmediateConstant(4))
+
+  def EmitLoadColBlock(self, registers, load_type, cols, elements, block,
+                       input_address, stride):
+    assert cols is len(block)
+    assert load_type is 8
+
+    input_deref = self.Dereference(input_address, None)
+    input_deref_increment = self.DereferenceIncrement(input_address, None)
+
+    if cols is 1:
+      for i in range(elements):
+        self.EmitVLoadOffset(1, 8,
+                             self.Lane(8, block[0], i), input_deref, stride)
+      self.EmitPld(input_address)
+      return block
+    elif cols is 2:
+      temp = [registers.DoubleRegister() for unused_i in range(2)]
+      for i in range(elements):
+        self.EmitVLoadOffset(1, 16,
+                             self.Lane(16, block[i / 4], i % 4), input_deref,
+                             stride)
+      self.EmitPld(input_address)
+      self.EmitVUzp(8, temp[0], temp[1], block[0], block[1])
+      registers.FreeRegisters(block)
+      return temp
+    elif cols is 3:
+      for i in range(elements):
+        self.EmitVLoadOffsetA(3, 8, [self.Lane(8, row, i) for row in block],
+                              input_deref, stride)
+      self.EmitPld(input_address)
+      return block
+    elif cols is 4:
+      temp = [registers.DoubleRegister() for unused_i in range(4)]
+      for i in range(elements):
+        self.EmitVLoadOffset(1, 32,
+                             self.Lane(32, block[i % 4], i / 4), input_deref,
+                             stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn(16, temp[0], temp[2], block[0], block[2])
+      self.EmitVTrn(16, temp[1], temp[3], block[1], block[3])
+      self.EmitVTrn(8, block[0], block[1], temp[0], temp[1])
+      self.EmitVTrn(8, block[2], block[3], temp[2], temp[3])
+      registers.FreeRegisters(temp)
+      return block
+    elif cols is 5:
+      temp = [registers.DoubleRegister() for unused_i in range(4)]
+      for i in range(elements):
+        self.EmitVLoad(1, 32,
+                       self.Lane(32, block[i % 4], i / 4),
+                       input_deref_increment)
+        self.EmitVLoadOffset(1, 8,
+                             self.Lane(8, block[4], i), input_deref, stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn(16, temp[0], temp[2], block[0], block[2])
+      self.EmitVTrn(16, temp[1], temp[3], block[1], block[3])
+      self.EmitVTrn(8, block[0], block[1], temp[0], temp[1])
+      self.EmitVTrn(8, block[2], block[3], temp[2], temp[3])
+      registers.FreeRegisters(temp)
+      return block
+    elif cols is 6:
+      temp = [registers.DoubleRegister() for unused_i in range(6)]
+      for i in range(elements):
+        self.EmitVLoad(1, 32,
+                       self.Lane(32, block[i % 4], i / 4),
+                       input_deref_increment)
+        self.EmitVLoadOffset(1, 16,
+                             self.Lane(16, block[4 + i / 4], i % 4),
+                             input_deref, stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn(16, temp[0], temp[2], block[0], block[2])
+      self.EmitVTrn(16, temp[1], temp[3], block[1], block[3])
+      self.EmitVUzp(8, temp[4], temp[5], block[4], block[5])
+      self.EmitVTrn(8, block[0], block[1], temp[0], temp[1])
+      self.EmitVTrn(8, block[2], block[3], temp[2], temp[3])
+      registers.FreeRegisters(
+          [block[4], block[5], temp[0], temp[1], temp[2], temp[3]])
+      return [block[0], block[1], block[2], block[3], temp[4], temp[5]]
+    elif cols is 7:
+      temp = [registers.DoubleRegister() for unused_i in range(4)]
+      for i in range(elements):
+        self.EmitVLoad(1, 32,
+                       self.Lane(32, block[i % 4], i / 4),
+                       input_deref_increment)
+        self.EmitVLoadOffsetA(3, 8,
+                              [self.Lane(8, row, i) for row in block[4:]],
+                              input_deref, stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn1(16, temp[0], block[0], block[2])
+      self.EmitVTrn2(16, temp[2], block[0], block[2])
+      self.EmitVTrn1(16, temp[1], block[1], block[3])
+      self.EmitVTrn2(16, temp[3], block[1], block[3])
+      self.EmitVTrn1(8, block[0], temp[0], temp[1])
+      self.EmitVTrn2(8, block[1], temp[0], temp[1])
+      self.EmitVTrn1(8, block[2], temp[2], temp[3])
+      self.EmitVTrn2(8, block[3], temp[2], temp[3])
+      registers.FreeRegisters(temp)
+      return block
+    elif cols is 8:
+      temp = [registers.DoubleRegister() for unused_i in range(8)]
+      for i in range(elements):
+        self.EmitVLoadOffset(1, 32, block[i], input_deref, stride)
+      self.EmitPld(input_address)
+      self.EmitVTrn(8, temp[0], temp[1], block[0], block[1])
+      self.EmitVTrn(8, temp[2], temp[3], block[2], block[3])
+      self.EmitVTrn(8, temp[4], temp[5], block[4], block[5])
+      self.EmitVTrn(8, temp[6], temp[7], block[6], block[7])
+      self.EmitVTrn(16, block[0], block[2], temp[0], temp[2])
+      self.EmitVTrn(16, block[1], block[3], temp[1], temp[3])
+      self.EmitVTrn(16, block[4], block[6], temp[4], temp[6])
+      self.EmitVTrn(16, block[5], block[7], temp[5], temp[7])
+      self.EmitVTrn(32, temp[0], temp[4], block[0], block[4])
+      self.EmitVTrn(32, temp[1], temp[5], block[1], block[5])
+      self.EmitVTrn(32, temp[2], temp[6], block[2], block[6])
+      self.EmitVTrn(32, temp[3], temp[7], block[3], block[7])
+      registers.FreeRegisters(block)
+      return temp
+    else:
+      assert False
+
+  def Dereference(self, value, unused_alignment=None):
+    new_value = value.Copy()
+    new_value.dereference = True
+    return new_value
+
+  def DereferenceIncrement(self, value, alignment=None):
+    new_value = self.Dereference(value, alignment).Copy()
+    new_value.dereference_increment = True
+    return new_value
+
+  def ImmediateConstant(self, value):
+    return _ImmediateConstant(value)
+
+  def AllLanes(self, value):
+    return '%s[]' % value
+
+  def Lane(self, bits, value, lane):
+    new_value = value.Copy()
+    if bits * (lane + 1) > new_value.register_bits:
+      raise ArgumentError('Lane to big: (%d + 1) x %d > %d' %
+                          (lane, bits, new_value.register_bits))
+    new_value.lane = lane
+    new_value.lane_bits = bits
+    return new_value
+
+  def CreateRegisters(self):
+    return _NeonRegisters64Bit()
diff --git a/meta/generators/qnt_Nx8_neon.py b/meta/generators/qnt_Nx8_neon.py
index 1cc6f25..5d983dc 100644
--- a/meta/generators/qnt_Nx8_neon.py
+++ b/meta/generators/qnt_Nx8_neon.py
@@ -37,9 +37,8 @@
     offset_registers = []
     for unused_i in range(0, lanes):
       register = registers.QuadRegister()
-      emitter.EmitVLoadA('1.32',
-                         [emitter.AllLanes(registers.Low(register)),
-                          emitter.AllLanes(registers.High(register))],
+      emitter.EmitVLoadA('1.32', [emitter.AllLanes(registers.Low(register)),
+                                  emitter.AllLanes(registers.High(register))],
                          emitter.DereferenceIncrement(offsets, 32))
       offset_registers.append(register)
     return offset_registers
@@ -47,17 +46,11 @@
     raise ConfigurationError('Unsupported number of lanes: %d' % lanes)
 
 
-def GenerateQntLanes(emitter,
-                     registers,
-                     qnt_lanes,
-                     source,
-                     stride,
-                     destination,
-                     destination_stride,
-                     offsets):
+def GenerateQntLanes(emitter, registers, qnt_lanes, source, stride, destination,
+                     destination_stride, offsets):
   """Prepare lanes for reading unquantized multiplication results."""
-  offset_registers = LoadAndDuplicateOffsets(
-      emitter, registers, qnt_lanes, offsets)
+  offset_registers = LoadAndDuplicateOffsets(emitter, registers, qnt_lanes,
+                                             offsets)
 
   lanes = []
   last_input_register = source
@@ -90,13 +83,8 @@
   return register
 
 
-def GenerateQuantize(emitter,
-                     registers,
-                     lanes,
-                     lane_temps,
-                     multiplicative_offset,
-                     rounding_offset,
-                     shift):
+def GenerateQuantize(emitter, registers, lanes, lane_temps,
+                     multiplicative_offset, rounding_offset, shift):
   """Inner loop for quantization: add offsets, multiply, round, shift."""
   for lane in lanes:
     emitter.EmitVAdd('i32', lane[0], lane[0], lane[1])
@@ -117,25 +105,18 @@
     emitter.EmitVQmovun('s16', registers.Low(lane_temp), lane_temp)
 
 
-def GenerateLoadQuantizeStore(emitter,
-                              registers,
-                              lanes,
-                              multiplicative_offset,
-                              rounding_offset,
-                              shift,
-                              alignment):
+def GenerateLoadQuantizeStore(emitter, registers, lanes, multiplicative_offset,
+                              rounding_offset, shift, alignment):
   """Load unquantized data from lanes, quantize, store final result."""
   lane_temps = []
   for lane in lanes:
     lane_temps.append(registers.QuadRegister())
 
   for lane in lanes:
-    emitter.EmitVLoadA('1.32',
-                       [registers.Low(lane.load_1),
-                        registers.High(lane.load_1),
-                        registers.Low(lane.load_2),
-                        registers.High(lane.load_2)],
-                       emitter.DereferenceIncrement(lane.source, 64))
+    emitter.EmitVLoadA(
+        '1.32', [registers.Low(lane.load_1), registers.High(lane.load_1),
+                 registers.Low(lane.load_2), registers.High(lane.load_2)],
+        emitter.DereferenceIncrement(lane.source, 64))
 
   for lane in lanes:
     emitter.EmitPld(lane.source)
@@ -145,17 +126,11 @@
     quantize_setup.append([lane.load_1, lane.offset, registers.Low(lane_temp)])
     quantize_setup.append([lane.load_2, lane.offset, registers.High(lane_temp)])
 
-  GenerateQuantize(emitter,
-                   registers,
-                   quantize_setup,
-                   lane_temps,
-                   multiplicative_offset,
-                   rounding_offset,
-                   shift)
+  GenerateQuantize(emitter, registers, quantize_setup, lane_temps,
+                   multiplicative_offset, rounding_offset, shift)
 
   for (lane_temp, lane) in zip(lane_temps, lanes):
-    emitter.EmitVStore('1.8',
-                       registers.Low(lane_temp),
+    emitter.EmitVStore('1.8', registers.Low(lane_temp),
                        emitter.DereferenceIncrement(lane.output, alignment))
 
   for lane_temp in lane_temps:
@@ -166,56 +141,50 @@
   """Handle non multiply of 8 leftover loading."""
   if leftovers == 1:
     for lane in lanes:
-      emitter.EmitVLoad('1.32',
-                        emitter.Lane(registers.Low(lane.load_1), 0),
+      emitter.EmitVLoad('1.32', emitter.Lane(
+          registers.Low(lane.load_1), 0),
                         emitter.Dereference(lane.source, None))
   elif leftovers == 2:
     for lane in lanes:
-      emitter.EmitVLoad('1.32',
-                        registers.Low(lane.load_1),
+      emitter.EmitVLoad('1.32', registers.Low(lane.load_1),
                         emitter.Dereference(lane.source, 64))
   elif leftovers == 3:
     for lane in lanes:
-      emitter.EmitVLoad('1.32',
-                        registers.Low(lane.load_1),
+      emitter.EmitVLoad('1.32', registers.Low(lane.load_1),
                         emitter.DereferenceIncrement(lane.source, 64))
     for lane in lanes:
-      emitter.EmitVLoad('1.32',
-                        emitter.Lane(registers.High(lane.load_1), 0),
+      emitter.EmitVLoad('1.32', emitter.Lane(
+          registers.High(lane.load_1), 0),
                         emitter.Dereference(lane.source, None))
   elif leftovers == 4:
     for lane in lanes:
-      emitter.EmitVLoadA('1.32',
-                         [registers.Low(lane.load_1),
-                          registers.High(lane.load_1)],
+      emitter.EmitVLoadA('1.32', [registers.Low(lane.load_1),
+                                  registers.High(lane.load_1)],
                          emitter.Dereference(lane.source, 64))
   elif leftovers == 5:
     for lane in lanes:
-      emitter.EmitVLoadA('1.32',
-                         [registers.Low(lane.load_1),
-                          registers.High(lane.load_1)],
+      emitter.EmitVLoadA('1.32', [registers.Low(lane.load_1),
+                                  registers.High(lane.load_1)],
                          emitter.DereferenceIncrement(lane.source, 64))
     for lane in lanes:
-      emitter.EmitVLoad('1.32',
-                        emitter.Lane(registers.Low(lane.load_2), 0),
+      emitter.EmitVLoad('1.32', emitter.Lane(
+          registers.Low(lane.load_2), 0),
                         emitter.Dereference(lane.source, None))
   elif leftovers == 6:
     for lane in lanes:
-      emitter.EmitVLoadA('1.32',
-                         [registers.Low(lane.load_1),
-                          registers.High(lane.load_1),
-                          registers.Low(lane.load_2)],
+      emitter.EmitVLoadA('1.32', [registers.Low(lane.load_1),
+                                  registers.High(lane.load_1),
+                                  registers.Low(lane.load_2)],
                          emitter.Dereference(lane.source, 64))
   elif leftovers == 7:
     for lane in lanes:
-      emitter.EmitVLoadA('1.32',
-                         [registers.Low(lane.load_1),
-                          registers.High(lane.load_1),
-                          registers.Low(lane.load_2)],
+      emitter.EmitVLoadA('1.32', [registers.Low(lane.load_1),
+                                  registers.High(lane.load_1),
+                                  registers.Low(lane.load_2)],
                          emitter.DereferenceIncrement(lane.source, 64))
     for lane in lanes:
-      emitter.EmitVLoad('1.32',
-                        emitter.Lane(registers.High(lane.load_2), 0),
+      emitter.EmitVLoad('1.32', emitter.Lane(
+          registers.High(lane.load_2), 0),
                         emitter.Dereference(lane.source, None))
   else:
     raise ConfigurationError('Unsuported leftover count: %d' % leftovers)
@@ -274,12 +243,8 @@
     raise ConfigurationError('Unsupported leftovers count: %d' % leftovers)
 
 
-def GenerateLeftoverLoadQuantizeStore(emitter,
-                                      registers,
-                                      leftovers,
-                                      lanes,
-                                      multiplicative_offset,
-                                      rounding_offset,
+def GenerateLeftoverLoadQuantizeStore(emitter, registers, leftovers, lanes,
+                                      multiplicative_offset, rounding_offset,
                                       shift):
   """Handle leftovers if row size not a multiply of 8."""
   lane_temps = []
@@ -292,16 +257,11 @@
   for (lane_temp, lane) in zip(lane_temps, lanes):
     quantize_setup.append([lane.load_1, lane.offset, registers.Low(lane_temp)])
     if leftovers > 4:
-      quantize_setup.append(
-          [lane.load_2, lane.offset, registers.High(lane_temp)])
+      quantize_setup.append([lane.load_2, lane.offset, registers.High(lane_temp)
+                            ])
 
-  GenerateQuantize(emitter,
-                   registers,
-                   quantize_setup,
-                   lane_temps,
-                   multiplicative_offset,
-                   rounding_offset,
-                   shift)
+  GenerateQuantize(emitter, registers, quantize_setup, lane_temps,
+                   multiplicative_offset, rounding_offset, shift)
 
   GenerateStoreLeftovers(emitter, registers, leftovers, lane_temps, lanes)
 
@@ -315,25 +275,20 @@
 
   name = BuildName(qnt_lanes, leftovers, aligned)
 
-  emitter.EmitFunctionBeginA(name,
-                             [['const std::int32_t*', 'source'],
-                              ['std::int32_t', 'count'],
-                              ['std::int32_t', 'stride'],
-                              ['const std::int32_t*', 'offsets'],
-                              ['std::uint8_t*', 'destination'],
-                              ['std::int32_t', 'destination_stride'],
-                              ['std::int32_t', 'multiplicative_offset'],
-                              ['std::int32_t', 'rounding_offset'],
-                              ['std::int32_t', 'shift']],
-                             'void')
+  emitter.EmitFunctionBeginA(
+      name,
+      [['const std::int32_t*', 'source'], ['std::int32_t', 'count'],
+       ['std::int32_t', 'stride'], ['const std::int32_t*', 'offsets'],
+       ['std::uint8_t*', 'destination'], ['std::int32_t', 'destination_stride'],
+       ['std::int32_t', 'multiplicative_offset'],
+       ['std::int32_t', 'rounding_offset'], ['std::int32_t', 'shift']], 'void')
   emitter.EmitAssert('count %% 8 == %d' % leftovers)
   emitter.EmitAssert('count >= 8')
   emitter.EmitAssert('reinterpret_cast<std::uintptr_t>(source) % 8 == 0')
   if aligned:
     emitter.EmitAssert('reinterpret_cast<std::uintptr_t>(destination) % 8 == 0')
     if qnt_lanes > 1:
-      emitter.EmitAssert(
-          'destination_stride % 8 == 0')
+      emitter.EmitAssert('destination_stride % 8 == 0')
   emitter.EmitAsmBegin()
 
   registers = neon_emitter.NeonRegisters()
@@ -342,15 +297,13 @@
 
   multiplicative_offset = DuplicateRegister(
       emitter, registers, registers.MapParameter('multiplicative_offset'))
-  rounding_offset = DuplicateRegister(
-      emitter, registers, registers.MapParameter('rounding_offset'))
+  rounding_offset = DuplicateRegister(emitter, registers,
+                                      registers.MapParameter('rounding_offset'))
   shift = DuplicateRegister(emitter, registers, registers.MapParameter('shift'))
 
   lanes = GenerateQntLanes(
-      emitter, registers, qnt_lanes,
-      registers.MapParameter('source'),
-      registers.MapParameter('stride'),
-      registers.MapParameter('destination'),
+      emitter, registers, qnt_lanes, registers.MapParameter('source'),
+      registers.MapParameter('stride'), registers.MapParameter('destination'),
       registers.MapParameter('destination_stride'),
       registers.MapParameter('offsets'))
 
@@ -362,36 +315,65 @@
   emitter.EmitNumericalLabel(1)
   emitter.EmitSubs(count, count, emitter.ImmediateConstant(8))
 
-  GenerateLoadQuantizeStore(emitter,
-                            registers,
-                            lanes,
-                            multiplicative_offset,
-                            rounding_offset,
-                            shift,
-                            64 if aligned else None)
+  GenerateLoadQuantizeStore(emitter, registers, lanes, multiplicative_offset,
+                            rounding_offset, shift, 64 if aligned else None)
 
   emitter.EmitNewline()
   emitter.EmitBneBack(1)
 
   if leftovers:
     emitter.EmitNumericalLabel(2)
-    GenerateLeftoverLoadQuantizeStore(emitter,
-                                      registers,
-                                      leftovers,
-                                      lanes,
-                                      multiplicative_offset,
-                                      rounding_offset,
+    GenerateLeftoverLoadQuantizeStore(emitter, registers, leftovers, lanes,
+                                      multiplicative_offset, rounding_offset,
                                       shift)
 
-  emitter.EmitAsmEnd(registers.MappedParameters(),
-                     [],
+  emitter.EmitAsmEnd(registers.MappedParameters(), [],
                      registers.Clobbers() + ['cc', 'memory'])
   emitter.EmitFunctionEnd()
 
 
-def GenerateFunctions(emitter):
+def BuildMultiQuantizeName(aligned, rows):
+  name = 'multi_qnt_%dx8' % rows
+  if aligned:
+    name = '%s_aligned' % name
+  return name
+
+
+def GenerateMultiQuantize(emitter, aligned, rows):
+  """Emit main quantization code that switches between optimized versions."""
+  name = BuildMultiQuantizeName(aligned, rows)
+  emitter.EmitFunctionBeginA(
+      name,
+      [['const std::int32_t*', 'source'], ['std::int32_t', 'count'],
+       ['std::int32_t', 'stride'], ['const std::int32_t*', 'offsets'],
+       ['std::uint8_t*', 'destination'], ['std::int32_t', 'destination_stride'],
+       ['std::int32_t', 'multiplicative_offset'],
+       ['std::int32_t', 'rounding_offset'], ['std::int32_t', 'shift']], 'void')
+  emitter.EmitSwitch('count % 8')
+
+  for leftovers in range(0, 8):
+    emitter.EmitCase(leftovers)
+    emitter.PushIndent()
+    emitter.EmitCall(
+        BuildName(rows, leftovers, aligned),
+        ['source', 'count', 'stride', 'offsets', 'destination',
+         'destination_stride', 'multiplicative_offset', 'rounding_offset',
+         'shift'])
+    emitter.EmitBreak()
+    emitter.PopIndent()
+
+  emitter.EmitSwitchEnd()
+  emitter.EmitFunctionEnd()
+
+
+def GenerateFunctions(neon, cc):
   for aligned in [True, False]:
     for lanes in range(1, 4):
       for leftovers in range(0, 8):
-        GenerateQntNx8(emitter, lanes, leftovers, aligned)
-        emitter.EmitNewline()
+        GenerateQntNx8(neon, lanes, leftovers, aligned)
+        neon.EmitNewline()
+
+  for aligned in [True, False]:
+    for rows in range(1, 4):
+      GenerateMultiQuantize(cc, aligned, rows)
+      cc.EmitNewline()
diff --git a/meta/generators/quantized_mul_kernels_arm_32.py b/meta/generators/quantized_mul_kernels_arm_32.py
new file mode 100644
index 0000000..df6274e
--- /dev/null
+++ b/meta/generators/quantized_mul_kernels_arm_32.py
@@ -0,0 +1,47 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Generates the arm32 headers used by the gemm/gemv lib."""
+
+import cc_emitter
+import common
+import neon_emitter
+import quantized_mul_kernels_common
+
+
+def Main():
+  """."""
+  cc = cc_emitter.CCEmitter()
+  common.GenerateHeader(cc, 'gemmlowp_meta_quantized_mul_kernels_arm_32',
+                        'GEMMLOWP_NEON_32')
+
+  cc.EmitNamespaceBegin('gemmlowp')
+  cc.EmitNamespaceBegin('meta')
+  cc.EmitNewline()
+
+  shapes = [(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8),
+            (2, 1), (2, 2), (2, 3), (2, 4), (3, 1), (3, 2), (3, 3)]
+
+  quantized_mul_kernels_common.GenerateKernels(cc,
+                                               neon_emitter.NeonEmitter(),
+                                               shapes)
+
+  cc.EmitNamespaceEnd()
+  cc.EmitNamespaceEnd()
+  cc.EmitNewline()
+
+  common.GenerateFooter(cc, 'Meta gemm for arm32 requires: GEMMLOWP_NEON_32!')
+
+
+if __name__ == '__main__':
+  Main()
diff --git a/meta/generators/quantized_mul_kernels_arm_64.py b/meta/generators/quantized_mul_kernels_arm_64.py
new file mode 100644
index 0000000..0bfbd0f
--- /dev/null
+++ b/meta/generators/quantized_mul_kernels_arm_64.py
@@ -0,0 +1,47 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Generates the arm32 headers used by the gemm/gemv lib."""
+
+import cc_emitter
+import common
+import neon_emitter_64
+import quantized_mul_kernels_common
+
+
+def Main():
+  """."""
+  cc = cc_emitter.CCEmitter()
+  common.GenerateHeader(cc, 'gemmlowp_meta_quantized_mul_kernels_arm_64',
+                        'GEMMLOWP_NEON_64')
+
+  cc.EmitNamespaceBegin('gemmlowp')
+  cc.EmitNamespaceBegin('meta')
+  cc.EmitNewline()
+
+  shapes = [(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8),
+            (2, 1), (2, 2), (2, 3), (2, 4), (3, 1), (3, 2), (3, 3)]
+
+  quantized_mul_kernels_common.GenerateKernels(cc,
+                                               neon_emitter_64.NeonEmitter64(),
+                                               shapes)
+
+  cc.EmitNamespaceEnd()
+  cc.EmitNamespaceEnd()
+  cc.EmitNewline()
+
+  common.GenerateFooter(cc, 'Meta gemm for arm64 requires: GEMMLOWP_NEON_64!')
+
+
+if __name__ == '__main__':
+  Main()
diff --git a/meta/generators/quantized_mul_kernels_common.py b/meta/generators/quantized_mul_kernels_common.py
new file mode 100644
index 0000000..ef69eb5
--- /dev/null
+++ b/meta/generators/quantized_mul_kernels_common.py
@@ -0,0 +1,641 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""."""
+
+import common
+
+
+def _ReadParams(emitter, registers, input_address, elements, min_register):
+  registers_count = (elements + 3) / 4
+  registers = [
+      registers.QuadRegister(min_register)
+      for unused_i in range(registers_count)
+  ]
+  emitter.EmitVLoadAE(registers_count * 4, 32, registers, input_address, 64)
+  return registers
+
+
+def _Duplicate(emitter, registers, rows, values):
+  """Populate a grid of registers duplicating provided values."""
+  duplicated = []
+  for i in range(rows):
+    if i is rows - 1:
+      duplicated.append(values[0])
+    else:
+      duplicated.append(registers.QuadRegister())
+
+    emitter.EmitVDup('32', duplicated[i],
+                     emitter.Lane(32, values[i / 4], i % 4))
+
+  return duplicated
+
+
+def _DuplicateGeneralRegister(emitter, registers, value, min_register):
+  register = registers.QuadRegister(min_register)
+  emitter.EmitVDup('32', register, value)
+  return register
+
+
+class _StaticQuantizationUInt8Transformation(object):
+  """Calculate quantized values and cast back to uint8."""
+
+  def Prepare(self, emitter, registers, kernel_m, kernel_n, lhs, rhs):
+    """Load parameters and prepare duplicated registers."""
+    emitter.EmitNewline()
+    emitter.EmitComment('StaticQuantization::Prepare')
+
+    lhs_offset = _ReadParams(emitter, registers, lhs, kernel_m, 4)
+    self.rhs_offsets = _ReadParams(emitter, registers, rhs, kernel_n, 4)
+    self.multiplicative_offset = _DuplicateGeneralRegister(
+        emitter, registers,
+        registers.MapParameter('multiplicative_offset',
+                               'params.kernel.multiplicative_offset'), 4)
+    self.rounding_offset = _DuplicateGeneralRegister(
+        emitter, registers,
+        registers.MapParameter('rounding_offset',
+                               'params.kernel.rounding_offset'), 4)
+    self.shift = _DuplicateGeneralRegister(
+        emitter, registers,
+        registers.MapParameter('shift', 'params.kernel.shift'), 4)
+    self.lhs_offsets = _Duplicate(emitter, registers, kernel_m, lhs_offset)
+
+  def Transform(self, emitter, registers, data, unused_kernel_m,
+                unused_kernel_n):
+    """Quantize the data."""
+    emitter.EmitNewline()
+    emitter.EmitComment('StaticQuantization::Transform')
+
+    for (row, lhs_offset) in zip(data, self.lhs_offsets):
+      for row_register in row:
+        emitter.EmitVAdd('s32', row_register, row_register, lhs_offset)
+
+    for row in data:
+      for (row_register, rhs_offset_register) in zip(row, self.rhs_offsets):
+        emitter.EmitVAdd('s32', row_register, row_register, rhs_offset_register)
+
+    for row in data:
+      for row_register in row:
+        emitter.EmitVMul('i32', row_register, row_register,
+                         self.multiplicative_offset)
+
+    for row in data:
+      for row_register in row:
+        emitter.EmitVAdd('i32', row_register, row_register,
+                         self.rounding_offset)
+
+    for row in data:
+      for row_register in row:
+        emitter.EmitVShl('s32', row_register, row_register, self.shift)
+
+    if len(data[0]) is 1:
+      for row in data:
+        emitter.EmitVQmovn('s32', row[0], row[0])
+
+      for row in data:
+        emitter.EmitVQmovun('s16', row[0], row[0])
+
+      return data
+    elif len(data[0]) is 2:
+      results = []
+      for row in data:
+        emitter.EmitVQmovn2('s32', row[0], row[0], row[1])
+        registers.FreeRegister(row[1])
+        results.append([row[0]])
+
+      for row in results:
+        emitter.EmitVQmovun('s16', row[0], row[0])
+
+      return results
+    else:
+      assert False
+
+  def Type(self):
+    return 8
+
+
+class _StaticQuantizationInt32Transformation(object):
+  """."""
+
+  def Prepare(self, emitter, registers, kernel_m, kernel_n, lhs, rhs):
+    emitter.EmitNewline()
+    emitter.EmitComment('StaticQuantizationInt32::Prepare')
+
+    lhs_offset = _ReadParams(emitter, registers, lhs, kernel_m, 4)
+    self.rhs_offsets = _ReadParams(emitter, registers, rhs, kernel_n, 4)
+    self.lhs_offsets = _Duplicate(emitter, registers, kernel_m, lhs_offset)
+
+  def Transform(self, emitter, unused_registers, data, unused_kernel_m,
+                unused_kernel_n):
+    """Quantize data and output as int32."""
+    emitter.EmitNewline()
+    emitter.EmitComment('StaticQuantizationInt32::Transform')
+
+    for (row, lhs_offset) in zip(data, self.lhs_offsets):
+      for row_register in row:
+        emitter.EmitVAdd('s32', row_register, row_register, lhs_offset)
+
+    for row in data:
+      for (row_register, rhs_offsets_register) in zip(row, self.rhs_offsets):
+        emitter.EmitVAdd('s32', row_register, row_register,
+                         rhs_offsets_register)
+
+    return data
+
+  def Type(self):
+    return 32
+
+
+class _StaticQuantizationFloatTransformation(object):
+  """."""
+
+  def Prepare(self, emitter, registers, kernel_m, kernel_n, lhs, rhs):
+    emitter.EmitNewline()
+    emitter.EmitComment('StaticQuantizationFloat::Prepare')
+
+    lhs_offset = _ReadParams(emitter, registers, lhs, kernel_m, 4)
+    self.rhs_offsets = _ReadParams(emitter, registers, rhs, kernel_n, 4)
+    self.scale = _DuplicateGeneralRegister(
+        emitter, registers,
+        registers.MapParameter('scale', 'params.kernel.scale'), 4)
+    self.lhs_offsets = _Duplicate(emitter, registers, kernel_m, lhs_offset)
+
+  def Transform(self, emitter, unused_registers, data, unused_kernel_m,
+                unused_kernel_n):
+    """Quantize data and output as float."""
+    emitter.EmitNewline()
+    emitter.EmitComment('StaticQuantizationFloat::Transform')
+
+    for (row, lhs_offset) in zip(data, self.lhs_offsets):
+      for row_register in row:
+        emitter.EmitVAdd('s32', row_register, row_register, lhs_offset)
+
+    for row in data:
+      for (row_register, rhs_offsets_register) in zip(row, self.rhs_offsets):
+        emitter.EmitVAdd('s32', row_register, row_register,
+                         rhs_offsets_register)
+
+    for row in data:
+      for row_register in row:
+        emitter.EmitVCvt('f32', 's32', row_register, row_register)
+
+    for row in data:
+      for row_register in row:
+        emitter.EmitVMul('f32', row_register, row_register, self.scale)
+
+    return data
+
+  def Type(self):
+    return 32
+
+
+class _RowMajorOutput(object):
+  """Output data in row major layout."""
+
+  def Prepare(self, emitter, registers, kernel_m, unused_kernel_n,
+              unused_data_type):
+    """Prepare strided load addresses."""
+    emitter.EmitNewline()
+    emitter.EmitComment('RowMajorOutput::Prepare')
+
+    stride = registers.MapParameter('stride', 'params.output_stream.stride')
+
+    self.outputs = []
+    self.outputs.append(registers.MapOutputParameter('result'))
+
+    for unused_i in range(kernel_m - 1):
+      register = registers.GeneralRegister()
+      emitter.EmitAdd(register, self.outputs[-1], stride)
+      self.outputs.append(register)
+
+  def Output(self, emitter, unused_registers, data, data_type, unused_kernel_m,
+             kernel_n):
+    emitter.EmitNewline()
+    emitter.EmitComment('RowMajorOutput::Output')
+
+    for (datum, output) in zip(data, self.outputs):
+      emitter.EmitVStoreAE(data_type, kernel_n, datum, output, None)
+
+
+def _GenerateAndClearAggregators(emitter, registers, count):
+  """Prepare aggregators and emit aggregator clear code."""
+  emitter.EmitNewline()
+  emitter.EmitComment('Clear aggregators.')
+  aggregators = [registers.QuadRegister() for unused_i in range(count)]
+  for i in range(count):
+    if i < 3:
+      emitter.EmitVMov('i32', aggregators[i], emitter.ImmediateConstant(0))
+    else:
+      emitter.EmitVMov('i32', aggregators[i], aggregators[i - 3])
+  return aggregators
+
+
+def _Generate3x3LoadMultiplyAggregate(emitter, registers, aggregators, lhs, rhs,
+                                      count):
+  """Emit inner loop for 3 rows x 3 cols multiplication."""
+  emitter.EmitNewline()
+  emitter.EmitComment('3x3 lanes loop.')
+  emitter.EmitNumericalLabel(1)
+  emitter.EmitNewline()
+
+  lhs_load = [registers.DoubleRegister() for unused_i in range(3)]
+  rhs_load = [registers.DoubleRegister() for unused_i in range(3)]
+  temp = [registers.QuadRegister() for unused_i in range(4)]
+
+  emitter.EmitVLoadA(1, 8, rhs_load, emitter.DereferenceIncrement(rhs, 64))
+  emitter.EmitVLoad(1, 8, lhs_load[0], emitter.DereferenceIncrement(lhs, 64))
+
+  emitter.EmitVMull('u8', temp[0], lhs_load[0], rhs_load[0])
+  emitter.EmitVLoad(1, 8, lhs_load[1], emitter.DereferenceIncrement(lhs, 64))
+
+  emitter.EmitVMull('u8', temp[1], lhs_load[0], rhs_load[1])
+  emitter.EmitVLoad(1, 8, lhs_load[2], emitter.DereferenceIncrement(lhs, 64))
+
+  emitter.EmitVMull('u8', temp[2], lhs_load[0], rhs_load[2])
+  emitter.EmitPldOffset(lhs, emitter.ImmediateConstant(64))
+
+  emitter.EmitVMull('u8', temp[3], lhs_load[1], rhs_load[0])
+  emitter.EmitPldOffset(rhs, emitter.ImmediateConstant(64))
+
+  emitter.EmitVPadal('u16', aggregators[0], temp[0])
+  emitter.EmitVPadal('u16', aggregators[1], temp[1])
+  emitter.EmitVPadal('u16', aggregators[2], temp[2])
+  emitter.EmitVPadal('u16', aggregators[3], temp[3])
+
+  emitter.EmitVMull('u8', temp[0], lhs_load[1], rhs_load[1])
+  emitter.EmitVMull('u8', temp[1], lhs_load[1], rhs_load[2])
+
+  registers.FreeRegisters([lhs_load[0], lhs_load[1]])
+  temp.append(registers.QuadRegister())
+
+  emitter.EmitVMull('u8', temp[2], lhs_load[2], rhs_load[0])
+  emitter.EmitVMull('u8', temp[3], lhs_load[2], rhs_load[1])
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Subtract counter.')
+  emitter.EmitSubs(count, count, emitter.ImmediateConstant(8))
+  emitter.EmitNewline()
+
+  emitter.EmitVMull('u8', temp[4], lhs_load[2], rhs_load[2])
+
+  emitter.EmitVPadal('u16', aggregators[4], temp[0])
+  emitter.EmitVPadal('u16', aggregators[5], temp[1])
+  emitter.EmitVPadal('u16', aggregators[6], temp[2])
+  emitter.EmitVPadal('u16', aggregators[7], temp[3])
+  emitter.EmitVPadal('u16', aggregators[8], temp[4])
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Loop break.')
+  emitter.EmitBgtBack(1)
+
+  registers.FreeRegisters(temp + [lhs_load[2]] + rhs_load)
+
+
+def _Generate2x4LoadMultiplyAggregate(emitter, registers, aggregators, lhs, rhs,
+                                      count):
+  """Emit inner loop for 2 rows x 4 cols multiplication."""
+  emitter.EmitNewline()
+  emitter.EmitComment('2x4 lanes loop.')
+  emitter.EmitNumericalLabel(1)
+  emitter.EmitNewline()
+
+  lhs_load = [registers.DoubleRegister() for unused_i in range(2)]
+  rhs_load = [registers.DoubleRegister() for unused_i in range(4)]
+  temp = [registers.QuadRegister() for unused_i in range(5)]
+
+  emitter.EmitVLoadA(1, 8, rhs_load, emitter.DereferenceIncrement(rhs, 256))
+  emitter.EmitVLoad(1, 8, lhs_load[0], emitter.DereferenceIncrement(lhs, 64))
+
+  emitter.EmitVMull('u8', temp[0], lhs_load[0], rhs_load[0])
+  emitter.EmitVLoad(1, 8, lhs_load[1], emitter.DereferenceIncrement(lhs, 64))
+
+  emitter.EmitVMull('u8', temp[1], lhs_load[0], rhs_load[1])
+  emitter.EmitPldOffset(rhs, emitter.ImmediateConstant(64))
+
+  emitter.EmitVMull('u8', temp[2], lhs_load[0], rhs_load[2])
+  emitter.EmitPldOffset(lhs, emitter.ImmediateConstant(64))
+
+  emitter.EmitVMull('u8', temp[3], lhs_load[0], rhs_load[3])
+  emitter.EmitVMull('u8', temp[4], lhs_load[1], rhs_load[0])
+
+  emitter.EmitVPadal('u16', aggregators[0], temp[0])
+  emitter.EmitVPadal('u16', aggregators[1], temp[1])
+  emitter.EmitVPadal('u16', aggregators[2], temp[2])
+
+  emitter.EmitVMull('u8', temp[0], lhs_load[1], rhs_load[1])
+  emitter.EmitVMull('u8', temp[1], lhs_load[1], rhs_load[2])
+  emitter.EmitVMull('u8', temp[2], lhs_load[1], rhs_load[3])
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Subtract counter.')
+  emitter.EmitSubs(count, count, emitter.ImmediateConstant(8))
+
+  emitter.EmitNewline()
+  emitter.EmitVPadal('u16', aggregators[3], temp[3])
+  emitter.EmitVPadal('u16', aggregators[4], temp[4])
+  emitter.EmitVPadal('u16', aggregators[5], temp[0])
+  emitter.EmitVPadal('u16', aggregators[6], temp[1])
+  emitter.EmitVPadal('u16', aggregators[7], temp[2])
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Loop break.')
+  emitter.EmitBgtBack(1)
+
+  registers.FreeRegisters(temp + lhs_load + rhs_load)
+
+
+def _Generate1x8LoadMultiplyAggregate(emitter, registers, aggregators, lhs, rhs,
+                                      count):
+  """Emit inner loop for 1 rows x 8 cols multiplication."""
+  emitter.EmitNewline()
+  emitter.EmitComment('1x8 lanes loop.')
+  emitter.EmitNumericalLabel(1)
+  emitter.EmitNewline()
+
+  lhs_load = registers.DoubleRegister()
+  rhs_load = [registers.DoubleRegister() for unused_i in range(4)]
+  temp = [registers.QuadRegister() for unused_i in range(5)]
+
+  emitter.EmitVLoadAE(4 * 8, 8, rhs_load, rhs, 256)
+  emitter.EmitVLoadE(8, 8, lhs_load, lhs, 64)
+
+  emitter.EmitVMull('u8', temp[0], lhs_load, rhs_load[0])
+  emitter.EmitVMull('u8', temp[1], lhs_load, rhs_load[1])
+  emitter.EmitVMull('u8', temp[2], lhs_load, rhs_load[2])
+  emitter.EmitVMull('u8', temp[3], lhs_load, rhs_load[3])
+
+  emitter.EmitVLoadAE(4 * 8, 8, rhs_load, rhs, 256)
+
+  emitter.EmitVPadal('u16', aggregators[0], temp[0])
+  emitter.EmitVPadal('u16', aggregators[1], temp[1])
+  emitter.EmitVPadal('u16', aggregators[2], temp[2])
+  emitter.EmitVPadal('u16', aggregators[3], temp[3])
+
+  emitter.EmitPldOffset(rhs, emitter.ImmediateConstant(256))
+
+  emitter.EmitVMull('u8', temp[4], lhs_load, rhs_load[0])
+  emitter.EmitVMull('u8', temp[0], lhs_load, rhs_load[1])
+  emitter.EmitVMull('u8', temp[1], lhs_load, rhs_load[2])
+  emitter.EmitVMull('u8', temp[2], lhs_load, rhs_load[3])
+
+  emitter.EmitPldOffset(lhs, emitter.ImmediateConstant(32))
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Subtract counter.')
+  emitter.EmitSubs(count, count, emitter.ImmediateConstant(8))
+
+  emitter.EmitNewline()
+  emitter.EmitVPadal('u16', aggregators[4], temp[4])
+  emitter.EmitVPadal('u16', aggregators[5], temp[0])
+  emitter.EmitVPadal('u16', aggregators[6], temp[1])
+  emitter.EmitVPadal('u16', aggregators[7], temp[2])
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Loop break.')
+  emitter.EmitBgtBack(1)
+
+  registers.FreeRegisters(temp + [lhs_load] + rhs_load)
+
+
+def _GenerateNxMLoadMultiplyAggregate(emitter, registers, kernel_m, kernel_n,
+                                      aggregators, lhs, rhs, count):
+  """Emit inner loop for N rows x M cols multiplication."""
+  emitter.EmitNewline()
+  emitter.EmitComment('General NxM lanes loop.')
+  emitter.EmitNumericalLabel(1)
+  emitter.EmitNewline()
+  emitter.EmitComment('Subtract counter.')
+  emitter.EmitSubs(count, count, emitter.ImmediateConstant(8))
+  emitter.EmitNewline()
+
+  lhs_load = [registers.DoubleRegister() for unused_i in range(kernel_m)]
+  rhs_load = [registers.DoubleRegister() for unused_i in range(kernel_n)]
+
+  emitter.EmitVLoadAE(8 * kernel_m, 8, lhs_load, lhs, 64)
+  emitter.EmitVLoadAE(8 * kernel_n, 8, rhs_load, rhs, 64)
+
+  emitter.EmitPldOffset(lhs, emitter.ImmediateConstant(64))
+  emitter.EmitPldOffset(rhs, emitter.ImmediateConstant(64))
+
+  results = [
+      registers.QuadRegister() for unused_i in range(kernel_m * kernel_n)
+  ]
+
+  for row in range(kernel_m):
+    for col in range(kernel_n):
+      index = row * kernel_n + col
+      emitter.EmitVMull('u8', results[index], rhs_load[col], lhs_load[row])
+
+  for i in range(kernel_m * kernel_n):
+    emitter.EmitVPadal('u16', aggregators[i], results[i])
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Loop break.')
+  emitter.EmitBgtBack(1)
+
+  registers.FreeRegisters(lhs_load + rhs_load + results)
+
+
+def _Generate1xNLoadMultiplyAggregate(emitter, registers, kernel_n, aggregators,
+                                      lhs, rhs, count):
+  """Emit inner loop for 1 row x M cols multiplication."""
+  assert kernel_n in [5, 6, 7, 8]
+  emitter.EmitNewline()
+  emitter.EmitComment('General 1xM lanes loop.')
+  emitter.EmitNumericalLabel(1)
+  emitter.EmitNewline()
+  emitter.EmitComment('Subtract counter.')
+  emitter.EmitSubs(count, count, emitter.ImmediateConstant(8))
+  emitter.EmitNewline()
+
+  leftover = kernel_n - 4
+
+  rhs_load = [registers.DoubleRegister() for unused_i in range(4)]
+  lhs_load = registers.DoubleRegister()
+
+  emitter.EmitVLoadAE(8 * 4, 8, rhs_load, rhs, 64)
+  emitter.EmitVLoadE(8, 8, lhs_load, lhs, 64)
+
+  emitter.EmitPldOffset(lhs, emitter.ImmediateConstant(64))
+
+  results = [registers.QuadRegister() for unused_i in range(4)]
+
+  for i in range(4):
+    emitter.EmitVMull('u8', results[i], rhs_load[i], lhs_load)
+
+  emitter.EmitVLoadAE(8 * leftover, 8, rhs_load, rhs, 64)
+  emitter.EmitPldOffset(rhs, emitter.ImmediateConstant(128))
+
+  for i in range(4):
+    emitter.EmitVPadal('u16', aggregators[i], results[i])
+
+  for i in range(leftover):
+    emitter.EmitVMull('u8', results[i], rhs_load[i], lhs_load)
+
+  for i in range(leftover):
+    emitter.EmitVPadal('u16', aggregators[i + 4], results[i])
+
+  emitter.EmitNewline()
+  emitter.EmitComment('Loop break.')
+  emitter.EmitBgtBack(1)
+
+  registers.FreeRegisters([lhs_load] + rhs_load + results)
+
+
+def _GenerateMultiplyKernel(emitter, registers, kernel_m, kernel_n, lhs, rhs):
+  """Main muliply loop. Pick best implementation for given kernel shape."""
+  count = registers.MapParameter('count', 'params.kernel.count')
+
+  aggregators = _GenerateAndClearAggregators(emitter, registers,
+                                             kernel_m * kernel_n)
+  if kernel_m is 3 and kernel_n is 3:
+    _Generate3x3LoadMultiplyAggregate(emitter, registers, aggregators, lhs, rhs,
+                                      count)
+  elif kernel_m is 2 and kernel_n is 4:
+    _Generate2x4LoadMultiplyAggregate(emitter, registers, aggregators, lhs, rhs,
+                                      count)
+  elif kernel_m is 1 and kernel_n is 8:
+    _Generate1x8LoadMultiplyAggregate(emitter, registers, aggregators, lhs, rhs,
+                                      count)
+  elif kernel_m is 1 and kernel_n > 4:
+    _Generate1xNLoadMultiplyAggregate(emitter, registers, kernel_n, aggregators,
+                                      lhs, rhs, count)
+  else:
+    _GenerateNxMLoadMultiplyAggregate(emitter, registers, kernel_m, kernel_n,
+                                      aggregators, lhs, rhs, count)
+  return aggregators
+
+
+def _ReduceAggregators(emitter, aggregators):
+  reduced_count = (len(aggregators) + 3) / 4
+  reduced = aggregators[:reduced_count]
+  emitter.EmitVSumReduce('u32', len(aggregators), 4, reduced, aggregators)
+  return reduced
+
+
+def _GenerateAggregatorReduce(emitter, aggregators, kernel_m, kernel_n):
+  emitter.EmitNewline()
+  emitter.EmitComment('Reduce aggregators.')
+  row_temps = []
+  for i in range(kernel_m):
+    row_temps.append(
+        _ReduceAggregators(emitter, aggregators[i * kernel_n:(i + 1) *
+                                                kernel_n]))
+  return row_temps
+
+
+class QuantizedMulKernel(common.MulKernelGenerator):
+  """."""
+
+  def __init__(self, cc_emitter, kernel_name, output_stream_name, asm_emitter,
+               fused_transformation, output_strategy):
+    common.MulKernelGenerator.__init__(self, cc_emitter, kernel_name,
+                                       output_stream_name)
+    self.asm_emitter = asm_emitter
+    self.fused_transformation = fused_transformation
+    self.output_strategy = output_strategy
+
+  def EmitMultiply(self, in_type, out_type, kernel_m, kernel_n, pack_size):
+    assert in_type is 'uint8_t'
+    assert pack_size is 8
+    assert kernel_m * kernel_n <= 9
+
+    registers = self.asm_emitter.CreateRegisters()
+
+    self.asm_emitter.PushIndent(self.emitter.indent)
+    self.asm_emitter.EmitAsmBegin()
+
+    lhs = registers.MapOutputParameter('lhs')
+    rhs = registers.MapOutputParameter('rhs')
+    self.asm_emitter.EmitPld(lhs)
+    self.asm_emitter.EmitPld(rhs)
+
+    aggregators = _GenerateMultiplyKernel(self.asm_emitter, registers, kernel_m,
+                                          kernel_n, lhs, rhs)
+
+    self.fused_transformation.Prepare(self.asm_emitter, registers, kernel_m,
+                                      kernel_n, lhs, rhs)
+
+    self.output_strategy.Prepare(self.asm_emitter, registers, kernel_m,
+                                 kernel_n, self.fused_transformation.Type())
+
+    reduced = _GenerateAggregatorReduce(self.asm_emitter, aggregators, kernel_m,
+                                        kernel_n)
+
+    transformed = self.fused_transformation.Transform(self.asm_emitter,
+                                                      registers, reduced,
+                                                      kernel_m, kernel_n)
+
+    self.output_strategy.Output(self.asm_emitter, registers, transformed,
+                                self.fused_transformation.Type(), kernel_m,
+                                kernel_n)
+
+    self.asm_emitter.EmitAsmEnd(registers)
+    self.asm_emitter.PopIndent(len(self.emitter.indent))
+
+
+class QuantizedMulStaticRowMajor(QuantizedMulKernel):
+  """."""
+
+  def __init__(self, cc_emitter, asm_emitter):
+    QuantizedMulKernel.__init__(self, cc_emitter, 'QuantizedStaticPreprocessed',
+                                'RowMajor', asm_emitter,
+                                _StaticQuantizationUInt8Transformation(),
+                                _RowMajorOutput())
+
+
+class QuantizedMulStaticAsInt32RowMajor(QuantizedMulKernel):
+  """."""
+
+  def __init__(self, cc_emitter, asm_emitter):
+    QuantizedMulKernel.__init__(self, cc_emitter,
+                                'QuantizedStaticPreprocessedAsInt32',
+                                'RowMajor', asm_emitter,
+                                _StaticQuantizationInt32Transformation(),
+                                _RowMajorOutput())
+
+
+class QuantizedMulStaticAsFloatRowMajor(QuantizedMulKernel):
+  """."""
+
+  def __init__(self, cc_emitter, asm_emitter):
+    QuantizedMulKernel.__init__(self, cc_emitter,
+                                'QuantizedStaticPreprocessedAsFloat',
+                                'RowMajor', asm_emitter,
+                                _StaticQuantizationFloatTransformation(),
+                                _RowMajorOutput())
+
+
+def GenerateKernels(cc_emitter, asm_emitter, shapes):
+  """Generate the quantized multiplication kernels for uint8 operands."""
+  quantized_mul_static_row_major = QuantizedMulStaticRowMajor(cc_emitter,
+                                                              asm_emitter)
+  quantized_mul_static_int32_row_major = QuantizedMulStaticAsInt32RowMajor(
+      cc_emitter, asm_emitter)
+
+  quantized_mul_static_float_row_major = QuantizedMulStaticAsFloatRowMajor(
+      cc_emitter, asm_emitter)
+
+  for shape in shapes:
+    quantized_mul_static_row_major.SpecializeMulKernel('uint8_t', 'uint8_t',
+                                                       shape[0], shape[1], 8)
+  for shape in shapes:
+    quantized_mul_static_int32_row_major.SpecializeMulKernel('uint8_t',
+                                                             'int32_t',
+                                                             shape[0], shape[1],
+                                                             8)
+
+  for shape in shapes:
+    quantized_mul_static_float_row_major.SpecializeMulKernel('uint8_t', 'float',
+                                                             shape[0], shape[1],
+                                                             8)
diff --git a/meta/generators/streams_arm_32.py b/meta/generators/streams_arm_32.py
new file mode 100644
index 0000000..56c1d35
--- /dev/null
+++ b/meta/generators/streams_arm_32.py
@@ -0,0 +1,41 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Generates the arm32 headers used by the gemm/gemv lib."""
+
+import cc_emitter
+import common
+import neon_emitter
+import streams_common
+
+
+def Main():
+  """."""
+  cc = cc_emitter.CCEmitter()
+  common.GenerateHeader(cc, 'gemmlowp_meta_streams_arm_32', 'GEMMLOWP_NEON_32')
+
+  cc.EmitNamespaceBegin('gemmlowp')
+  cc.EmitNamespaceBegin('meta')
+  cc.EmitNewline()
+
+  streams_common.GenerateUInt8x8Streams(cc, neon_emitter.NeonEmitter(), 8)
+
+  cc.EmitNamespaceEnd()
+  cc.EmitNamespaceEnd()
+  cc.EmitNewline()
+
+  common.GenerateFooter(cc, 'Meta gemm for arm32 requires: GEMMLOWP_NEON_32!')
+
+
+if __name__ == '__main__':
+  Main()
diff --git a/meta/generators/streams_arm_64.py b/meta/generators/streams_arm_64.py
new file mode 100644
index 0000000..ee7bda9
--- /dev/null
+++ b/meta/generators/streams_arm_64.py
@@ -0,0 +1,41 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Generates the arm32 headers used by the gemm/gemv lib."""
+
+import cc_emitter
+import common
+import neon_emitter_64
+import streams_common
+
+
+def Main():
+  """."""
+  cc = cc_emitter.CCEmitter()
+  common.GenerateHeader(cc, 'gemmlowp_meta_streams_arm_64', 'GEMMLOWP_NEON_64')
+
+  cc.EmitNamespaceBegin('gemmlowp')
+  cc.EmitNamespaceBegin('meta')
+  cc.EmitNewline()
+
+  streams_common.GenerateUInt8x8Streams(cc, neon_emitter_64.NeonEmitter64(), 8)
+
+  cc.EmitNamespaceEnd()
+  cc.EmitNamespaceEnd()
+  cc.EmitNewline()
+
+  common.GenerateFooter(cc, 'Meta gemm for arm64 requires: GEMMLOWP_NEON_64!')
+
+
+if __name__ == '__main__':
+  Main()
diff --git a/meta/generators/streams_common.py b/meta/generators/streams_common.py
new file mode 100644
index 0000000..720d3e1
--- /dev/null
+++ b/meta/generators/streams_common.py
@@ -0,0 +1,304 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""."""
+
+import common
+
+
+def _AlignForLanes(lanes_count):
+  if lanes_count is 8 or lanes_count is 4:
+    return 256
+  elif lanes_count is 6 or lanes_count is 2:
+    return 128
+  else:
+    return 64
+
+
+def _AlignForSums(lanes_count):
+  if lanes_count is 8:
+    return 256
+  elif lanes_count in [2, 4, 6]:
+    return 128
+  else:
+    return 64
+
+
+def _GenerateInputs(emitter, registers, lanes_count, input_address, stride):
+  """."""
+  inputs = []
+  last_address_register = input_address
+  for i in range(lanes_count):
+    if not i:
+      inputs.append(input_address)
+    else:
+      address_register = registers.GeneralRegister()
+      inputs.append(address_register)
+      emitter.EmitAdd(address_register, last_address_register, stride)
+      last_address_register = address_register
+  return inputs
+
+
+def _GenerateClear(emitter, clear_type, block):
+  for row in block:
+    emitter.EmitVMov(clear_type, row, emitter.ImmediateConstant(0))
+
+
+def _GenerateLoadAggregateStore(emitter, registers, lanes_count, elements_count,
+                                aggregators, inputs, output):
+  """Emit inner loop code for reading N lanes and interweaving them."""
+  emitter.EmitNewline()
+  emitter.EmitComment('Load Aggregate Store: %dx%d.' % (lanes_count,
+                                                        elements_count))
+
+  block = [registers.DoubleRegister() for unused_i in range(lanes_count)]
+
+  if elements_count is not 8:
+    _GenerateClear(emitter, 'i8', block)
+
+  for (row, input_address) in zip(block, inputs):
+    emitter.EmitVLoadE(8, elements_count, row, input_address, None)
+
+  for (aggregator, row) in zip(aggregators, block):
+    emitter.EmitVAddw('u8', aggregator, aggregator, row)
+
+  emitter.EmitVStoreAE(8, 8 * lanes_count, block, output,
+                       _AlignForLanes(lanes_count))
+
+  registers.FreeRegisters(block)
+
+
+def _LoadMemoryParameter(emitter, registers, name, source):
+  register = registers.GeneralRegister()
+  emitter.EmitLdr(register, registers.MapMemoryParameter(name, source))
+  return register
+
+
+def _GenerateAggregatorReductionLowRegisters(emitter, registers,
+                                             aggregators, output_address):
+  emitter.EmitNewline()
+  emitter.EmitComment('Aggregator Reduction.')
+  _GenerateAggregatorReduction(
+      emitter, registers, aggregators, output_address,
+      _LoadMemoryParameter(emitter, registers, 'multiplicative_sum_offset',
+                           'params.multiplicative_sum_offset'),
+      _LoadMemoryParameter(emitter, registers, 'additive_sum_offset',
+                           'params.additive_sum_offset'))
+
+
+def _GenerateAggregatorReductionHighRegisters(emitter, registers,
+                                              aggregators, output_address):
+  emitter.EmitNewline()
+  emitter.EmitComment('Aggregator Reduction.')
+  _GenerateAggregatorReduction(
+      emitter, registers, aggregators, output_address,
+      registers.MapParameter('multiplicative_sum_offset',
+                             'params.multiplicative_sum_offset'),
+      registers.MapParameter('additive_sum_offset',
+                             'params.additive_sum_offset'))
+
+
+def _GenerateAggregatorReduction(emitter, registers, aggregators,
+                                 output_address, multiplicative_sum_offset,
+                                 additive_sum_offset):
+  """Reduce 4 lane sum aggregators to 1 value and store the sums."""
+  multiplier = registers.DoubleRegister()
+  emitter.EmitVMov('32',
+                   emitter.Lane(32, multiplier, 0), multiplicative_sum_offset)
+
+  offset = registers.QuadRegister()
+  emitter.EmitVDup('32', offset, additive_sum_offset)
+
+  for aggregator in aggregators:
+    emitter.EmitVPaddl('u16', aggregator, aggregator)
+
+  reduced_count = (len(aggregators) + 3) / 4
+  reduced = aggregators[:reduced_count]
+
+  emitter.EmitVSumReduce('u32', len(aggregators), 4, reduced, aggregators)
+
+  for temp in reduced:
+    emitter.EmitVMulScalar('i32', temp, temp, emitter.Lane(32, multiplier, 0))
+
+  for temp in reduced:
+    emitter.EmitVAdd('i32', temp, temp, offset)
+
+  emitter.EmitVStoreA(1, 32, reduced,
+                      emitter.Dereference(output_address,
+                                          _AlignForSums(len(aggregators))))
+
+
+class RowMajorWithSumUInt8x8(common.StreamGenerator):
+  """."""
+
+  def __init__(self, emitter, asm_emitter):
+    common.StreamGenerator.__init__(self, emitter, 'RowMajorWithSum')
+    self.asm_emitter = asm_emitter
+
+  def EmitPack(self, in_type, lanes_count, pack_size, leftovers):
+    assert pack_size is 8
+    assert in_type is 'uint8_t'
+
+    registers = self.asm_emitter.CreateRegisters()
+
+    self.emitter.EmitDeclare('int', 'params_count_copy', 'params.count')
+
+    self.asm_emitter.PushIndent(self.emitter.indent)
+    self.asm_emitter.EmitAsmBegin()
+
+    count = registers.MapOutputParameter('count', 'params_count_copy')
+    output = registers.MapOutputParameter('out')
+    inputs = _GenerateInputs(self.asm_emitter, registers, lanes_count,
+                             registers.MapOutputParameter('in'),
+                             registers.MapParameter('stride', 'params.stride'))
+    aggregators = [registers.QuadRegister(8) for unused_i in range(lanes_count)]
+
+    _GenerateClear(self.asm_emitter, 'i16', aggregators)
+
+    if leftovers:
+      self.asm_emitter.EmitNewline()
+      self.asm_emitter.EmitComment('Reduce count by leftovers.')
+      self.asm_emitter.EmitSubs(count, count,
+                                self.asm_emitter.ImmediateConstant(leftovers))
+      self.asm_emitter.EmitBeqFront(2)
+
+    self.asm_emitter.EmitNewline()
+    self.asm_emitter.EmitNumericalLabel(1)
+    self.asm_emitter.EmitSubs(count, count,
+                              self.asm_emitter.ImmediateConstant(8))
+
+    _GenerateLoadAggregateStore(self.asm_emitter, registers, lanes_count, 8,
+                                aggregators, inputs, output)
+
+    self.asm_emitter.EmitNewline()
+    self.asm_emitter.EmitBneBack(1)
+
+    if leftovers:
+      self.asm_emitter.EmitNewline()
+      self.asm_emitter.EmitNumericalLabel(2)
+      _GenerateLoadAggregateStore(self.asm_emitter, registers, lanes_count,
+                                  leftovers, aggregators, inputs, output)
+
+    registers.FreeRegisters(inputs)
+
+    if len(inputs) <= 6:
+      _GenerateAggregatorReductionHighRegisters(
+          self.asm_emitter, registers, aggregators, output)
+    else:
+      _GenerateAggregatorReductionLowRegisters(
+          self.asm_emitter, registers, aggregators, output)
+
+    self.asm_emitter.EmitAsmEnd(registers)
+    self.asm_emitter.PopIndent(len(self.emitter.indent))
+
+
+def _GenerateColLoadAggregateStore(emitter, registers, lanes_count,
+                                   elements_count, aggregators, input_address,
+                                   stride, output):
+  """Emit inner loop code for reading N col lanes and interweaving them."""
+  emitter.EmitNewline()
+  emitter.EmitComment('Load Aggregate Store - column major %dx%d' %
+                      (lanes_count, elements_count))
+
+  block = [registers.DoubleRegister() for unused_i in range(lanes_count)]
+
+  if elements_count is not 8:
+    _GenerateClear(emitter, 'i8', block)
+
+  block = emitter.EmitLoadColBlock(registers, 8, lanes_count, elements_count,
+                                   block, input_address, stride)
+
+  for (aggregator, row) in zip(aggregators, block):
+    emitter.EmitVAddw('u8', aggregator, aggregator, row)
+
+  emitter.EmitVStoreAE(8, 8 * lanes_count, block, output,
+                       _AlignForLanes(lanes_count))
+
+  registers.FreeRegisters(block)
+
+
+class ColumnMajorWithSumUInt8x8(common.StreamGenerator):
+  """."""
+
+  def __init__(self, emitter, asm_emitter):
+    common.StreamGenerator.__init__(self, emitter, 'ColumnMajorWithSum')
+    self.asm_emitter = asm_emitter
+
+  def EmitPack(self, in_type, lanes_count, pack_size, leftovers):
+    assert pack_size is 8
+    assert in_type is 'uint8_t'
+
+    registers = self.asm_emitter.CreateRegisters()
+
+    self.emitter.EmitDeclare('int', 'params_count_copy', 'params.count')
+    self.emitter.EmitDeclare('int', 'params_stride_copy', 'params.stride')
+
+    self.asm_emitter.PushIndent(self.emitter.indent)
+    self.asm_emitter.EmitAsmBegin()
+
+    count = registers.MapOutputParameter('count', 'params_count_copy')
+    input_address = registers.MapOutputParameter('in')
+    output_address = registers.MapOutputParameter('out')
+    aggregators = [registers.QuadRegister(8) for unused_i in range(lanes_count)]
+    stride = registers.MapOutputParameter('stride', 'params_stride_copy')
+
+    self.asm_emitter.EmitColBlockStride(lanes_count, stride, stride)
+
+    _GenerateClear(self.asm_emitter, 'i16', aggregators)
+
+    if leftovers:
+      self.asm_emitter.EmitNewline()
+      self.asm_emitter.EmitComment('Reduce count by leftovers.')
+      self.asm_emitter.EmitSubs(count, count,
+                                self.asm_emitter.ImmediateConstant(leftovers))
+      self.asm_emitter.EmitBeqFront(2)
+
+    self.asm_emitter.EmitNewline()
+    self.asm_emitter.EmitNumericalLabel(1)
+    self.asm_emitter.EmitSubs(count, count,
+                              self.asm_emitter.ImmediateConstant(8))
+
+    _GenerateColLoadAggregateStore(self.asm_emitter, registers, lanes_count, 8,
+                                   aggregators, input_address, stride,
+                                   output_address)
+
+    self.asm_emitter.EmitNewline()
+    self.asm_emitter.EmitBneBack(1)
+
+    if leftovers:
+      self.asm_emitter.EmitNewline()
+      self.asm_emitter.EmitNumericalLabel(2)
+      _GenerateColLoadAggregateStore(self.asm_emitter, registers, lanes_count,
+                                     leftovers, aggregators, input_address,
+                                     stride, output_address)
+
+
+    _GenerateAggregatorReductionHighRegisters(
+        self.asm_emitter, registers, aggregators, output_address)
+
+    self.asm_emitter.EmitAsmEnd(registers)
+    self.asm_emitter.PopIndent(len(self.emitter.indent))
+
+
+def GenerateUInt8x8Streams(cc_emitter, asm_emitter, lanes_count):
+  row_major_with_sum = RowMajorWithSumUInt8x8(cc_emitter, asm_emitter)
+  column_major_with_sum = ColumnMajorWithSumUInt8x8(cc_emitter, asm_emitter)
+
+  for lanes_count in range(1, 1 + lanes_count):
+    for leftovers in range(8):
+      row_major_with_sum.SpecializeStream('uint8_t', lanes_count, 8, leftovers)
+
+  for lanes_count in range(1, 1 + lanes_count):
+    for leftovers in range(8):
+      column_major_with_sum.SpecializeStream('uint8_t', lanes_count, 8,
+                                             leftovers)
diff --git a/meta/generators/transform_kernels_arm_32.py b/meta/generators/transform_kernels_arm_32.py
new file mode 100644
index 0000000..97ca276
--- /dev/null
+++ b/meta/generators/transform_kernels_arm_32.py
@@ -0,0 +1,44 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Generates the arm32 headers used by the gemm/gemv lib."""
+
+import cc_emitter
+import common
+import neon_emitter
+import transform_kernels_common
+
+
+def Main():
+  """."""
+  cc = cc_emitter.CCEmitter()
+  common.GenerateHeader(cc, 'gemmlowp_meta_transform_kernels_arm_32',
+                        'GEMMLOWP_NEON_32')
+
+  cc.EmitNamespaceBegin('gemmlowp')
+  cc.EmitNamespaceBegin('meta')
+  cc.EmitNewline()
+
+  transform_kernels_common.GenerateKernels(cc,
+                                           neon_emitter.NeonEmitter(),
+                                           [(16, x) for x in range(16)])
+
+  cc.EmitNamespaceEnd()
+  cc.EmitNamespaceEnd()
+  cc.EmitNewline()
+
+  common.GenerateFooter(cc, 'Meta gemm for arm32 requires: GEMMLOWP_NEON_32!')
+
+
+if __name__ == '__main__':
+  Main()
diff --git a/meta/generators/transform_kernels_arm_64.py b/meta/generators/transform_kernels_arm_64.py
new file mode 100644
index 0000000..8245f33
--- /dev/null
+++ b/meta/generators/transform_kernels_arm_64.py
@@ -0,0 +1,44 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Generates the arm32 headers used by the gemm/gemv lib."""
+
+import cc_emitter
+import common
+import neon_emitter_64
+import transform_kernels_common
+
+
+def Main():
+  """."""
+  cc = cc_emitter.CCEmitter()
+  common.GenerateHeader(cc, 'gemmlowp_meta_transform_kernels_arm_64',
+                        'GEMMLOWP_NEON_64')
+
+  cc.EmitNamespaceBegin('gemmlowp')
+  cc.EmitNamespaceBegin('meta')
+  cc.EmitNewline()
+
+  transform_kernels_common.GenerateKernels(cc,
+                                           neon_emitter_64.NeonEmitter64(),
+                                           [(16, x) for x in range(16)])
+
+  cc.EmitNamespaceEnd()
+  cc.EmitNamespaceEnd()
+  cc.EmitNewline()
+
+  common.GenerateFooter(cc, 'Meta gemm for arm64 requires: GEMMLOWP_NEON_64!')
+
+
+if __name__ == '__main__':
+  Main()
diff --git a/meta/generators/transform_kernels_common.py b/meta/generators/transform_kernels_common.py
new file mode 100644
index 0000000..436b40c
--- /dev/null
+++ b/meta/generators/transform_kernels_common.py
@@ -0,0 +1,590 @@
+# Copyright 2016 The Gemmlowp Authors. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""."""
+
+import common
+
+
+def _DuplicateGeneralRegister(size, emitter, registers, value, min_register):
+  register = registers.QuadRegister(min_register)
+  emitter.EmitVDup(size, register, value)
+  return register
+
+
+def _DuplicateGeneralMemoryRegister(size, emitter, registers, value,
+                                    min_register):
+  register = registers.QuadRegister(min_register)
+  general = registers.GeneralRegister()
+  emitter.EmitLdr(general, value)
+  emitter.EmitVDup(size, register, general)
+  registers.FreeRegister(general)
+  return register
+
+
+class MinMaxTransformation(object):
+  """."""
+
+  def Check(self, in_type, out_type, kernel_size, leftovers):
+    assert in_type is 'uint8_t'
+    assert out_type is 'uint8_t'
+    assert kernel_size is 16
+    assert leftovers < 16
+
+  def Prepare(self, emitter, registers, unused_kernel_size):
+    emitter.EmitNewline()
+    emitter.EmitComment('MinMax::Prepare')
+
+    self.min = _DuplicateGeneralRegister(8, emitter, registers,
+                                         registers.MapParameter('min',
+                                                                'params.min'),
+                                         4)
+    self.max = _DuplicateGeneralRegister(8, emitter, registers,
+                                         registers.MapParameter('max',
+                                                                'params.max'),
+                                         4)
+
+  def Transform(self, emitter, registers, input_address, elements,
+                output_address):
+    """Generate the MinMax transform inner loop code."""
+    emitter.EmitNewline()
+    emitter.EmitComment('MinMax::Transform')
+    register_count = (elements + 15) / 16
+    load = [registers.QuadRegister() for unused_i in range(register_count)]
+    emitter.EmitVLoadAE(8, elements, load, input_address, None)
+    emitter.EmitPldOffset(input_address, emitter.ImmediateConstant(16))
+
+    for register in load:
+      emitter.EmitVMax('u8', register, register, self.min)
+
+    for register in load:
+      emitter.EmitVMin('u8', register, register, self.max)
+
+    emitter.EmitNewline()
+    emitter.EmitVStoreAE(8, elements, load, output_address, None)
+    emitter.EmitPld(output_address)
+    registers.FreeRegisters(load)
+
+
+class DequantizeTransformation(object):
+  """."""
+
+  def Check(self, in_type, out_type, kernel_size, leftovers):
+    assert in_type is 'uint8_t'
+    assert out_type is 'float'
+    assert kernel_size is 16
+    assert leftovers < 16
+
+  def Prepare(self, emitter, registers, unused_kernel_size):
+    """Duplicate quantization offsets to vector registers."""
+    emitter.EmitNewline()
+    emitter.EmitComment('Dequantize::Prepare')
+
+    self.range_min = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('range_min', 'params.range_min'), 4)
+    self.range_offset = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('range_offset', 'params.range_offset'), 4)
+    self.range_scale = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('range_scale', 'params.range_scale'), 4)
+
+  def Transform(self, emitter, registers, input_address, elements,
+                output_address):
+    """Emit the dequantization inner loop."""
+    emitter.EmitNewline()
+    emitter.EmitComment('Dequantize::Transform')
+    register_count = (elements + 3) / 4
+    load = [registers.QuadRegister() for unused_i in range(register_count)]
+    emitter.EmitVLoadAE(8, elements, load, input_address, None)
+    emitter.EmitPldOffset(input_address, emitter.ImmediateConstant(32))
+
+    if len(load) is 1:
+      emitter.EmitVMovl('u8', load[0], load[0])
+      emitter.EmitVMovl('s16', load[0], load[0])
+    elif len(load) is 2:
+      emitter.EmitVMovl('u8', load[0], load[0])
+      emitter.EmitVMovl2('s16', load[0], load[1], load[0])
+    elif len(load) is 3:
+      emitter.EmitVMovl2('u8', load[0], load[1], load[0])
+      emitter.EmitVMovl('s16', load[2], load[1])
+      emitter.EmitVMovl2('s16', load[0], load[1], load[0])
+    elif len(load) is 4:
+      emitter.EmitVMovl2('u8', load[0], load[1], load[0])
+      emitter.EmitVMovl2('s16', load[2], load[3], load[1])
+      emitter.EmitVMovl2('s16', load[0], load[1], load[0])
+    else:
+      assert False
+
+    for register in load:
+      emitter.EmitVCvt('f32', 's32', register, register)
+
+    for register in load:
+      emitter.EmitVSub('f32', register, register, self.range_offset)
+
+    for register in load:
+      emitter.EmitVMul('f32', register, register, self.range_scale)
+
+    for register in load:
+      emitter.EmitVAdd('f32', register, register, self.range_min)
+
+    emitter.EmitNewline()
+    emitter.EmitVStoreAE(32, elements, load, output_address, None)
+    emitter.EmitPld(output_address)
+    registers.FreeRegisters(load)
+
+
+class QuantizeTransformation(object):
+  """."""
+
+  def Check(self, in_type, out_type, kernel_size, leftovers):
+    assert in_type is 'float'
+    assert out_type is 'uint8_t'
+    assert kernel_size is 16
+    assert leftovers < 16
+
+  def Prepare(self, emitter, registers, unused_kernel_size):
+    """Duplicate quantization offsets to vector registers."""
+    emitter.EmitNewline()
+    emitter.EmitComment('Quantize::Prepare')
+
+    self.range_min = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('range_min', 'params.range_min'), 4)
+    self.range_offset = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('range_offset', 'params.range_offset'), 4)
+    self.range_scale = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('range_scale', 'params.range_scale'), 4)
+
+  def Transform(self, emitter, registers, input_address, elements,
+                output_address):
+    """Emit quantization inner loop code."""
+    emitter.EmitNewline()
+    emitter.EmitComment('Quantize::Transform')
+    register_count = (elements + 3) / 4
+    load = [registers.QuadRegister() for unused_i in range(register_count)]
+    emitter.EmitVLoadAE(32, elements, load, input_address, None)
+    emitter.EmitPldOffset(input_address, emitter.ImmediateConstant(64))
+
+    for register in load:
+      emitter.EmitVSub('f32', register, register, self.range_min)
+
+    for register in load:
+      emitter.EmitVMul('f32', register, register, self.range_scale)
+
+    for register in load:
+      emitter.EmitVAdd('f32', register, register, self.range_offset)
+
+    for register in load:
+      emitter.EmitVCvt('s32', 'f32', register, register)
+
+    if len(load) is 1:
+      emitter.EmitVQmovn('s32', load[0], load[0])
+      emitter.EmitVQmovun('s16', load[0], load[0])
+    elif len(load) is 2:
+      emitter.EmitVQmovn2('s32', load[0], load[0], load[1])
+      emitter.EmitVQmovun('s16', load[0], load[0])
+    elif len(load) is 3:
+      emitter.EmitVQmovn2('s32', load[0], load[0], load[1])
+      emitter.EmitVQmovn('s32', load[2], load[2])
+      emitter.EmitVQmovun2('s16', load[0], load[0], load[2])
+    elif len(load) is 4:
+      emitter.EmitVQmovn2('s32', load[0], load[0], load[1])
+      emitter.EmitVQmovn2('s32', load[2], load[2], load[3])
+      emitter.EmitVQmovun2('s16', load[0], load[0], load[2])
+    else:
+      assert False
+
+    emitter.EmitNewline()
+    emitter.EmitVStoreAE(8, elements, load, output_address, None)
+    emitter.EmitPld(output_address)
+    registers.FreeRegisters(load)
+
+
+class RequantizeTransformation(object):
+  """."""
+
+  def Check(self, in_type, out_type, kernel_size, leftovers):
+    assert in_type is 'int32_t'
+    assert out_type is 'uint8_t'
+    assert kernel_size is 16
+    assert leftovers < 16
+
+  def Prepare(self, emitter, registers, unused_kernel_size):
+    """Duplicate quantization parameters to vector registers."""
+    emitter.EmitNewline()
+    emitter.EmitComment('Requantize::Prepare')
+
+    self.range_min_delta = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('input_range_min', 'params.input_range_min'), 4)
+    self.output_range_min = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('output_range_min', 'params.output_range_min'),
+        4)
+    self.input_range_offset = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('input_range_offset',
+                               'params.input_range_offset'), 4)
+    self.input_range_scale = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('input_range_scale', 'params.input_range_scale'),
+        4)
+    self.one_over_output_range_scale = _DuplicateGeneralRegister(
+        32, emitter, registers,
+        registers.MapParameter('one_over_output_range_scale',
+                               'params.one_over_output_range_scale'), 4)
+    emitter.EmitVSub('f32', self.range_min_delta, self.range_min_delta,
+                     self.output_range_min)
+
+  def Transform(self, emitter, registers, input_address, elements,
+                output_address):
+    """Emit requantization inner loop code."""
+    emitter.EmitNewline()
+    emitter.EmitComment('Requantize::Transform')
+    register_count = (elements + 3) / 4
+    load = [registers.QuadRegister() for unused_i in range(register_count)]
+    emitter.EmitVLoadAE(32, elements, load, input_address, None)
+    emitter.EmitPldOffset(input_address, emitter.ImmediateConstant(64))
+
+    for register in load:
+      emitter.EmitVCvt('f32', 's32', register, register)
+
+    for register in load:
+      emitter.EmitVSub('f32', register, register, self.input_range_offset)
+
+    for register in load:
+      emitter.EmitVMul('f32', register, register, self.input_range_scale)
+
+    for register in load:
+      emitter.EmitVAdd('f32', register, register, self.range_min_delta)
+
+    for register in load:
+      emitter.EmitVMul('f32', register, register,
+                       self.one_over_output_range_scale)
+
+    for register in load:
+      emitter.EmitVCvt('s32', 'f32', register, register)
+
+    if len(load) is 1:
+      emitter.EmitVQmovn('s32', load[0], load[0])
+      emitter.EmitVQmovun('s16', load[0], load[0])
+    elif len(load) is 2:
+      emitter.EmitVQmovn2('s32', load[0], load[0], load[1])
+      emitter.EmitVQmovun('s16', load[0], load[0])
+    elif len(load) is 3:
+      emitter.EmitVQmovn2('s32', load[0], load[0], load[1])
+      emitter.EmitVQmovn('s32', load[2], load[2])
+      emitter.EmitVQmovun2('s16', load[0], load[0], load[2])
+    elif len(load) is 4:
+      emitter.EmitVQmovn2('s32', load[0], load[0], load[1])
+      emitter.EmitVQmovn2('s32', load[2], load[2], load[3])
+      emitter.EmitVQmovun2('s16', load[0], load[0], load[2])
+    else:
+      assert False
+
+    emitter.EmitNewline()
+    emitter.EmitVStoreAE(8, elements, load, output_address, None)
+    emitter.EmitPld(output_address)
+    registers.FreeRegisters(load)
+
+
+class BaseTransform(common.Transform1DKernelGenerator):
+  """."""
+
+  def __init__(self, cc_emitter, kernel_name, asm_emitter, transformation):
+    common.Transform1DKernelGenerator.__init__(self, cc_emitter, kernel_name)
+    self.asm_emitter = asm_emitter
+    self.transformation = transformation
+
+  def EmitTransform(self, in_type, out_type, kernel_size, leftovers):
+    """."""
+    self.transformation.Check(in_type, out_type, kernel_size, leftovers)
+
+    registers = self.asm_emitter.CreateRegisters()
+
+    self.emitter.EmitDeclare('int', 'params_count_copy', 'params.count')
+
+    self.asm_emitter.PushIndent(self.emitter.indent)
+    self.asm_emitter.EmitAsmBegin()
+
+    count = registers.MapOutputParameter('count', 'params_count_copy')
+    input_address = registers.MapOutputParameter('input')
+    output_address = registers.MapOutputParameter('output')
+
+    self.transformation.Prepare(self.asm_emitter, registers, kernel_size)
+
+    if leftovers:
+      self.asm_emitter.EmitNewline()
+      self.asm_emitter.EmitComment('Reduce count by leftovers.')
+      self.asm_emitter.EmitSubs(count, count,
+                                self.asm_emitter.ImmediateConstant(leftovers))
+      self.asm_emitter.EmitBeqFront(2)
+
+    self.asm_emitter.EmitNewline()
+    self.asm_emitter.EmitNumericalLabel(1)
+    self.asm_emitter.EmitSubs(count, count,
+                              self.asm_emitter.ImmediateConstant(kernel_size))
+
+    self.transformation.Transform(self.asm_emitter, registers, input_address,
+                                  kernel_size, output_address)
+
+    self.asm_emitter.EmitNewline()
+    self.asm_emitter.EmitBneBack(1)
+
+    if leftovers:
+      self.asm_emitter.EmitNumericalLabel(2)
+      self.asm_emitter.EmitNewline()
+      self.asm_emitter.EmitComment('Handle leftovers.')
+      self.transformation.Transform(self.asm_emitter, registers, input_address,
+                                    leftovers, output_address)
+
+    self.asm_emitter.EmitAsmEnd(registers)
+    self.asm_emitter.PopIndent(len(self.emitter.indent))
+
+
+class Requantize(BaseTransform):
+  """."""
+
+  def __init__(self, cc_emitter, asm_emitter):
+    BaseTransform.__init__(self, cc_emitter, 'Requantize', asm_emitter,
+                           RequantizeTransformation())
+
+
+class Quantize(BaseTransform):
+  """."""
+
+  def __init__(self, cc_emitter, asm_emitter):
+    BaseTransform.__init__(self, cc_emitter, 'Quantize', asm_emitter,
+                           QuantizeTransformation())
+
+
+class Dequantize(BaseTransform):
+  """."""
+
+  def __init__(self, cc_emitter, asm_emitter):
+    BaseTransform.__init__(self, cc_emitter, 'Dequantize', asm_emitter,
+                           DequantizeTransformation())
+
+
+class MinMax(BaseTransform):
+  """."""
+
+  def __init__(self, numerical_type, cc_emitter, asm_emitter):
+    BaseTransform.__init__(self, cc_emitter, 'MinMax<%s>' % numerical_type,
+                           asm_emitter, MinMaxTransformation())
+
+
+class BiasAdd(common.Transform1DKernelGenerator):
+  """."""
+
+  def __init__(self, bias_type, cc_emitter, asm_emitter):
+    common.Transform1DKernelGenerator.__init__(self, cc_emitter,
+                                               'BiasAdd<%s>' % bias_type)
+    self.asm_emitter = asm_emitter
+
+  def EmitTransform(self, in_type, out_type, kernel_size, leftovers):
+    """."""
+    assert in_type is 'uint8_t'
+    assert out_type is 'int32_t'
+    assert kernel_size is 16
+    assert leftovers < 16
+
+    registers = self.asm_emitter.CreateRegisters()
+
+    self.emitter.EmitDeclare('int', 'params_rows_copy', 'params.rows')
+
+    self.asm_emitter.PushIndent(self.emitter.indent)
+    self.asm_emitter.EmitAsmBegin()
+
+    self._Prepare(self.asm_emitter, registers)
+
+    rows = registers.MapParameter('rows', 'params_rows_copy')
+
+    self.asm_emitter.EmitNumericalLabel(1)
+
+    self._ProcessRow(self.asm_emitter, registers, kernel_size, leftovers)
+
+    self.asm_emitter.EmitSubs(rows, rows, self.asm_emitter.ImmediateConstant(1))
+    self.asm_emitter.EmitBneBack(1)
+
+    self.asm_emitter.EmitAsmEnd(registers)
+    self.asm_emitter.PopIndent(len(self.emitter.indent))
+
+  def _Prepare(self, emitter, registers):
+    self.input_range_min = _DuplicateGeneralMemoryRegister(
+        32, emitter, registers,
+        registers.MapMemoryParameter('input_range_min',
+                                     'params.input_range_min'), 8)
+    self.input_range_scale = _DuplicateGeneralMemoryRegister(
+        32, emitter, registers,
+        registers.MapMemoryParameter('input_range_scale',
+                                     'params.input_range_scale'), 8)
+    self.bias_range_min = _DuplicateGeneralMemoryRegister(
+        32, emitter, registers,
+        registers.MapMemoryParameter('bias_range_min', 'params.bias_range_min'),
+        8)
+    self.bias_range_scale = _DuplicateGeneralMemoryRegister(
+        32, emitter, registers,
+        registers.MapMemoryParameter('bias_range_scale',
+                                     'params.bias_range_scale'), 8)
+    self.output_range_min = _DuplicateGeneralMemoryRegister(
+        32, emitter, registers,
+        registers.MapMemoryParameter('output_range_min',
+                                     'params.output_range_min'), 8)
+    self.one_over_output_range_scale = _DuplicateGeneralMemoryRegister(
+        32, emitter, registers,
+        registers.MapMemoryParameter('one_over_output_range_scale',
+                                     'params.one_over_output_range_scale'), 8)
+    self.output_range_offset = _DuplicateGeneralMemoryRegister(
+        32, emitter, registers,
+        registers.MapMemoryParameter('output_range_offset',
+                                     'params.output_range_offset'), 8)
+
+  def _ProcessRow(self, emitter, registers, kernel_size, leftovers):
+    const_count = registers.MapParameter('count', 'params.count')
+    const_bias = registers.MapParameter('bias', 'params.bias')
+
+    count = registers.GeneralRegister()
+    bias = registers.GeneralRegister()
+
+    input_address = registers.MapOutputParameter('input')
+    output_address = registers.MapOutputParameter('output')
+
+    emitter.EmitMov(count, const_count)
+    emitter.EmitMov(bias, const_bias)
+
+    if leftovers:
+      emitter.EmitSubs(count, count, emitter.ImmediateConstant(leftovers))
+      emitter.EmitBeqFront(3)
+
+    emitter.EmitNumericalLabel(2)
+    emitter.EmitSubs(count, count, emitter.ImmediateConstant(kernel_size))
+
+    self._BiasAdd(emitter, registers, kernel_size, input_address, bias,
+                  output_address)
+
+    emitter.EmitBneBack(2)
+
+    if leftovers:
+      emitter.EmitNumericalLabel(3)
+      self._BiasAdd(emitter, registers, leftovers, input_address, bias,
+                    output_address)
+
+  def _BiasAdd(self, emitter, registers, elements, input_address, bias,
+               output_address):
+    emitter.EmitNewline()
+    emitter.EmitComment('BiasAdd::Transform')
+    register_count = (elements + 3) / 4
+
+    load_input = [
+        registers.QuadRegister() for unused_i in range(register_count)
+    ]
+    load_bias = [registers.QuadRegister() for unused_i in range(register_count)]
+
+    emitter.EmitVLoadAE(8, elements, load_input, input_address, None)
+    emitter.EmitVLoadAE(8, elements, load_bias, bias, None)
+    emitter.EmitPldOffset(input_address, emitter.ImmediateConstant(32))
+
+    if len(load_input) is 1:
+      emitter.EmitVMovl('u8', load_input[0], load_input[0])
+      emitter.EmitVMovl('u8', load_bias[0], load_bias[0])
+      emitter.EmitVMovl('s16', load_input[0], load_input[0])
+      emitter.EmitVMovl('s16', load_bias[0], load_bias[0])
+    elif len(load_input) is 2:
+      emitter.EmitVMovl('u8', load_input[0], load_input[0])
+      emitter.EmitVMovl('u8', load_bias[0], load_bias[0])
+      emitter.EmitVMovl2('s16', load_input[0], load_input[1], load_input[0])
+      emitter.EmitVMovl2('s16', load_bias[0], load_bias[1], load_bias[0])
+    elif len(load_input) is 3:
+      emitter.EmitVMovl2('u8', load_input[0], load_input[1], load_input[0])
+      emitter.EmitVMovl2('u8', load_bias[0], load_bias[1], load_bias[0])
+      emitter.EmitVMovl('s16', load_input[2], load_input[1])
+      emitter.EmitVMovl('s16', load_bias[2], load_bias[1])
+      emitter.EmitVMovl2('s16', load_input[0], load_input[1], load_input[0])
+      emitter.EmitVMovl2('s16', load_bias[0], load_bias[1], load_bias[0])
+    elif len(load_input) is 4:
+      emitter.EmitVMovl2('u8', load_input[0], load_input[1], load_input[0])
+      emitter.EmitVMovl2('u8', load_bias[0], load_bias[1], load_bias[0])
+      emitter.EmitVMovl2('s16', load_input[2], load_input[3], load_input[1])
+      emitter.EmitVMovl2('s16', load_bias[2], load_bias[3], load_bias[1])
+      emitter.EmitVMovl2('s16', load_input[0], load_input[1], load_input[0])
+      emitter.EmitVMovl2('s16', load_bias[0], load_bias[1], load_bias[0])
+    else:
+      assert False
+
+    for register in load_input + load_bias:
+      emitter.EmitVCvt('f32', 's32', register, register)
+
+    for register in load_input:
+      emitter.EmitVMul('f32', register, register, self.input_range_scale)
+
+    for register in load_bias:
+      emitter.EmitVMul('f32', register, register, self.bias_range_scale)
+
+    for register in load_input:
+      emitter.EmitVAdd('f32', register, register, self.input_range_min)
+
+    for register in load_bias:
+      emitter.EmitVAdd('f32', register, register, self.bias_range_min)
+
+    for (register_1, register_2) in zip(load_input, load_bias):
+      emitter.EmitVAdd('f32', register_1, register_1, register_2)
+
+    for register in load_input:
+      emitter.EmitVSub('f32', register, register, self.output_range_min)
+
+    for register in load_input:
+      emitter.EmitVMul('f32', register, register,
+                       self.one_over_output_range_scale)
+
+    for register in load_input:
+      emitter.EmitVAdd('f32', register, register, self.output_range_offset)
+
+    for register in load_input:
+      emitter.EmitVCvt('s32', 'f32', register, register)
+
+    emitter.EmitNewline()
+    emitter.EmitVStoreAE(32, elements, load_input, output_address, None)
+    emitter.EmitPld(output_address)
+    registers.FreeRegisters(load_input + load_bias)
+
+
+def GenerateKernels(cc_emitter, asm_emitter, shapes):
+  """Generate the quantization/dequantization/requantization kernels."""
+  requantize = Requantize(cc_emitter, asm_emitter)
+  quantize = Quantize(cc_emitter, asm_emitter)
+  dequantize = Dequantize(cc_emitter, asm_emitter)
+  minmax = MinMax('uint8_t', cc_emitter, asm_emitter)
+  biasadd = BiasAdd('uint8_t', cc_emitter, asm_emitter)
+
+  for shape in shapes:
+    requantize.SpecializeTransform1DKernel('int32_t', 'uint8_t', shape[0],
+                                           shape[1])
+
+  for shape in shapes:
+    quantize.SpecializeTransform1DKernel('float', 'uint8_t', shape[0], shape[1])
+
+  for shape in shapes:
+    dequantize.SpecializeTransform1DKernel('uint8_t', 'float', shape[0],
+                                           shape[1])
+
+  for shape in shapes:
+    minmax.SpecializeTransform1DKernel('uint8_t', 'uint8_t', shape[0], shape[1])
+
+  for shape in shapes:
+    biasadd.SpecializeTransform1DKernel('uint8_t', 'int32_t', shape[0],
+                                        shape[1])
diff --git a/meta/generators/zip_Nx8_neon.py b/meta/generators/zip_Nx8_neon.py
index 2a1e8c5..4bf485e 100644
--- a/meta/generators/zip_Nx8_neon.py
+++ b/meta/generators/zip_Nx8_neon.py
@@ -5,7 +5,6 @@
 end.
 """
 
-
 import neon_emitter
 
 
@@ -42,13 +41,11 @@
   last_address_register = input_address
   for i in range(0, zip_lanes):
     if not i:
-      lanes.append(ZipLane(input_address,
-                           registers.DoubleRegister(),
+      lanes.append(ZipLane(input_address, registers.DoubleRegister(),
                            registers.QuadRegister(2)))
     else:
       address_register = registers.GeneralRegister()
-      lanes.append(ZipLane(address_register,
-                           registers.DoubleRegister(),
+      lanes.append(ZipLane(address_register, registers.DoubleRegister(),
                            registers.QuadRegister(2)))
       emitter.EmitAdd(address_register, last_address_register, stride)
       last_address_register = address_register
@@ -88,8 +85,8 @@
                       emitter.DereferenceIncrement(output_address, 64))
 
 
-def GenerateLeftoverLoadAggregateStore(
-    emitter, leftovers, lanes, output_address):
+def GenerateLeftoverLoadAggregateStore(emitter, leftovers, lanes,
+                                       output_address):
   """Handle leftovers when count is not a multiply of 8."""
   emitter.EmitNewline()
   emitter.EmitComment('Leftover Load Aggregate Store.')
@@ -111,9 +108,8 @@
   elif leftovers == 3:
     # Load 16 bits.
     for lane in lanes:
-      emitter.EmitVLoad(
-          '1.16', emitter.Lane(lane.load, 0),
-          emitter.DereferenceIncrement(lane.input_address, None))
+      emitter.EmitVLoad('1.16', emitter.Lane(lane.load, 0),
+                        emitter.DereferenceIncrement(lane.input_address, None))
     # Load 8 bits.
     for lane in lanes:
       emitter.EmitVLoad('1.8', emitter.Lane(lane.load, 2),
@@ -126,9 +122,8 @@
   elif leftovers == 5:
     # Load 32 bits..
     for lane in lanes:
-      emitter.EmitVLoad(
-          '1.32', emitter.Lane(lane.load, 0),
-          emitter.DereferenceIncrement(lane.input_address, None))
+      emitter.EmitVLoad('1.32', emitter.Lane(lane.load, 0),
+                        emitter.DereferenceIncrement(lane.input_address, None))
     # Load 8 bits.
     for lane in lanes:
       emitter.EmitVLoad('1.8', emitter.Lane(lane.load, 4),
@@ -136,9 +131,8 @@
   elif leftovers == 6:
     # Load 32 bits..
     for lane in lanes:
-      emitter.EmitVLoad(
-          '1.32', emitter.Lane(lane.load, 0),
-          emitter.DereferenceIncrement(lane.input_address, None))
+      emitter.EmitVLoad('1.32', emitter.Lane(lane.load, 0),
+                        emitter.DereferenceIncrement(lane.input_address, None))
     # Load 16 bits.
     for lane in lanes:
       emitter.EmitVLoad('1.16', emitter.Lane(lane.load, 2),
@@ -146,14 +140,12 @@
   elif leftovers == 7:
     # Load 32 bits..
     for lane in lanes:
-      emitter.EmitVLoad(
-          '1.32', emitter.Lane(lane.load, 0),
-          emitter.DereferenceIncrement(lane.input_address, None))
+      emitter.EmitVLoad('1.32', emitter.Lane(lane.load, 0),
+                        emitter.DereferenceIncrement(lane.input_address, None))
     # Load 16 bits.
     for lane in lanes:
-      emitter.EmitVLoad(
-          '1.16', emitter.Lane(lane.load, 2),
-          emitter.DereferenceIncrement(lane.input_address, None))
+      emitter.EmitVLoad('1.16', emitter.Lane(lane.load, 2),
+                        emitter.DereferenceIncrement(lane.input_address, None))
     # Load 8 bits.
     for lane in lanes:
       emitter.EmitVLoad('1.8', emitter.Lane(lane.load, 6),
@@ -172,12 +164,8 @@
                       emitter.DereferenceIncrement(output_address, 64))
 
 
-def GenerateAggregatorReduction(emitter,
-                                registers,
-                                lanes,
-                                output_address,
-                                multiplicative_offset,
-                                additive_offset):
+def GenerateAggregatorReduction(emitter, registers, lanes, output_address,
+                                multiplicative_offset, additive_offset):
   """Reduce 4 lane sum aggregators to 1 value and store the sums."""
   emitter.EmitNewline()
   emitter.EmitComment('Aggregator Reduction.')
@@ -194,9 +182,7 @@
   for lane in lanes:
     lane_temp = registers.DoubleRegister()
     lane_temps.append(lane_temp)
-    emitter.EmitVPadd('u32',
-                      lane_temp,
-                      registers.Low(lane.aggregator),
+    emitter.EmitVPadd('u32', lane_temp, registers.Low(lane.aggregator),
                       registers.High(lane.aggregator))
 
   temp = registers.QuadRegister()
@@ -214,46 +200,41 @@
     emitter.EmitVPadd('u32', low, lane_temps[0], lane_temps[1])
     emitter.EmitVPadd('u32', high, lane_temps[2], lane_temps[3])
   else:
-    raise ConfigurationError(
-        'Unexpected number of aggregators to reduce: %d' % len(lanes))
+    raise ConfigurationError('Unexpected number of aggregators to reduce: %d' %
+                             len(lanes))
 
   emitter.EmitVMul('i32', temp, temp, emitter.Lane(multiplier, 0))
   emitter.EmitVAdd('i32', temp, temp, offset)
 
   if len(lanes) == 1:
-    emitter.EmitVStore(
-        '1.32', emitter.Lane(low, 0), emitter.Dereference(output_address, None))
+    emitter.EmitVStore('1.32', emitter.Lane(low, 0),
+                       emitter.Dereference(output_address, None))
   elif len(lanes) == 2:
     emitter.EmitVStore('1.32', low, emitter.Dereference(output_address, 64))
   elif len(lanes) == 3:
-    emitter.EmitVStore(
-        '1.32', low, emitter.DereferenceIncrement(output_address, 64))
-    emitter.EmitVStore(
-        '1.32', emitter.Lane(high, 0),
-        emitter.Dereference(output_address, None))
+    emitter.EmitVStore('1.32', low,
+                       emitter.DereferenceIncrement(output_address, 64))
+    emitter.EmitVStore('1.32', emitter.Lane(high, 0),
+                       emitter.Dereference(output_address, None))
   elif len(lanes) == 4:
-    emitter.EmitVStore(
-        '1.32', low, emitter.DereferenceIncrement(output_address, 64))
-    emitter.EmitVStore('1.32', high, emitter.Dereference(output_address, 64))
+    emitter.EmitVStoreA('1.32', [low, high],
+                        emitter.DereferenceIncrement(output_address, 64))
 
 
 def GenerateZipNx8(emitter, zip_lanes, leftovers, aligned):
   """Emit the zip function for a given number of rows and row size leftovers."""
   if leftovers < 0 or leftovers > 7:
     raise ConfigurationError('Leftovers should be between 0 and 7 inclusive.')
-  if zip_lanes < 1 or zip_lanes > 3:
-    raise ConfigurationError('Zip_lanes should should be 1, 2 or 3.')
+  if zip_lanes < 1 or zip_lanes > 4:
+    raise ConfigurationError('Zip_lanes should should be 1, 2, 3 or 4.')
 
   name = BuildName(zip_lanes, leftovers, aligned)
 
-  emitter.EmitFunctionBeginA(name,
-                             [['const std::uint8_t*', 'source'],
-                              ['std::int32_t', 'count'],
-                              ['std::int32_t', 'stride'],
-                              ['std::uint8_t*', 'destination'],
-                              ['std::int32_t', 'multiplicative_offset'],
-                              ['std::int32_t', 'additive_offset']],
-                             'void')
+  emitter.EmitFunctionBeginA(
+      name, [['const std::uint8_t*', 'source'], ['std::int32_t', 'count'],
+             ['std::int32_t', 'stride'], ['std::uint8_t*', 'destination'],
+             ['std::int32_t', 'multiplicative_offset'],
+             ['std::int32_t', 'additive_offset']], 'void')
   emitter.EmitAssert('count %% 8 == %d' % leftovers)
   emitter.EmitAssert('count <= 2048')
   emitter.EmitAssert('count >= 8')
@@ -269,9 +250,7 @@
   count = registers.MapParameter('count')
   output_address = registers.MapParameter('destination')
 
-  lanes = GenerateZipLanes(emitter,
-                           registers,
-                           zip_lanes,
+  lanes = GenerateZipLanes(emitter, registers, zip_lanes,
                            registers.MapParameter('source'),
                            registers.MapParameter('stride'))
 
@@ -284,32 +263,28 @@
   emitter.EmitNumericalLabel(1)
   emitter.EmitSubs(count, count, emitter.ImmediateConstant(8))
 
-  GenerateLoadAggregateStore(
-      emitter, lanes, output_address, 64 if aligned else None)
+  GenerateLoadAggregateStore(emitter, lanes, output_address, 64 if aligned else
+                             None)
 
   emitter.EmitNewline()
   emitter.EmitBneBack(1)
 
   if leftovers:
-    GenerateLeftoverLoadAggregateStore(
-        emitter, leftovers, lanes, output_address)
+    GenerateLeftoverLoadAggregateStore(emitter, leftovers, lanes,
+                                       output_address)
 
-  GenerateAggregatorReduction(emitter,
-                              registers,
-                              lanes,
-                              output_address,
+  GenerateAggregatorReduction(emitter, registers, lanes, output_address,
                               registers.MapParameter('multiplicative_offset'),
                               registers.MapParameter('additive_offset'))
 
-  emitter.EmitAsmEnd(registers.MappedParameters(),
-                     [],
+  emitter.EmitAsmEnd(registers.MappedParameters(), [],
                      registers.Clobbers() + ['cc', 'memory'])
   emitter.EmitFunctionEnd()
 
 
 def GenerateFunctions(emitter):
   for aligned in [True, False]:
-    for lanes in range(1, 4):
+    for lanes in range(1, 5):
       for leftovers in range(0, 8):
         GenerateZipNx8(emitter, lanes, leftovers, aligned)
         emitter.EmitNewline()
diff --git a/meta/legacy_multi_thread_common.h b/meta/legacy_multi_thread_common.h
new file mode 100644
index 0000000..d4cec6f
--- /dev/null
+++ b/meta/legacy_multi_thread_common.h
@@ -0,0 +1,151 @@
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// multi_thread_common.h: Multithreading code shared by different meta gemm
+// versions.
+
+#ifndef GEMMLOWP_META_MULTI_THREAD_COMMON_H_
+#define GEMMLOWP_META_MULTI_THREAD_COMMON_H_
+
+#include "../internal/multi_thread_gemm.h"
+
+namespace gemmlowp {
+namespace meta {
+namespace internal {
+
+const std::int32_t kMinTaskSize = 16000;
+const std::int32_t kMinTaskDimension = 4;
+
+struct TaskRect {
+  std::int32_t m_offset;
+  std::int32_t m;
+  std::int32_t n_offset;
+  std::int32_t n;
+
+  TaskRect(std::int32_t m_offset, std::int32_t m, std::int32_t n_offset,
+           std::int32_t n)
+      : m_offset(m_offset), m(m), n_offset(n_offset), n(n) {}
+};
+
+template <typename IN_TYPE, typename OUT_TYPE, typename F>
+struct MetaTask : gemmlowp::Task {
+  std::uint8_t* scratch;
+  const IN_TYPE* lhs;
+  const IN_TYPE* rhs;
+  TaskRect task_rect;
+  std::int32_t k;
+  OUT_TYPE* result;
+  std::int32_t result_stride;
+  const F& operation;
+
+  MetaTask(std::uint8_t* scratch, const IN_TYPE* lhs, const IN_TYPE* rhs,
+           const TaskRect& task_rect, std::int32_t k, OUT_TYPE* result,
+           std::int32_t result_stride, const F& operation)
+      : scratch(scratch),
+        lhs(lhs),
+        rhs(rhs),
+        task_rect(task_rect),
+        k(k),
+        result(result),
+        result_stride(result_stride),
+        operation(operation) {}
+
+  void Run() override {
+    const IN_TYPE* task_lhs = lhs + task_rect.m_offset * k;
+    const IN_TYPE* task_rhs = rhs + task_rect.n_offset * k;
+    OUT_TYPE* task_result =
+        result + task_rect.m_offset * result_stride + task_rect.n_offset;
+    operation.ExecuteMatrixMatrix(scratch, task_lhs, task_rhs, task_rect.m,
+                                  task_rect.n, k, task_result, result_stride);
+  }
+};
+
+std::int32_t ResolveMaxThreads(std::int32_t max_threads) {
+  if (max_threads == 0) {
+    static const int hardware_threads_count =
+        static_cast<int>(sysconf(_SC_NPROCESSORS_CONF));
+    return hardware_threads_count;
+  }
+  return max_threads;
+}
+
+void PrepareTasks(std::int32_t max_tasks, std::int32_t m, std::int32_t n,
+                  std::int32_t k, std::vector<internal::TaskRect>* tasks) {
+  const std::int32_t max_tasks_by_size = (m * n * k) / kMinTaskSize;
+  const std::int32_t max_tasks_m = m / kMinTaskDimension;
+  const std::int32_t max_tasks_n = n / kMinTaskDimension;
+  const std::int32_t max_tasks_dimension = std::max(max_tasks_m, max_tasks_n);
+
+  std::int32_t real_tasks = std::max(
+      1, std::min(max_tasks, std::min(max_tasks_by_size, max_tasks_dimension)));
+
+  if (real_tasks == 1) {
+    tasks->push_back(TaskRect(0, m, 0, n));
+    return;
+  }
+
+  if (max_tasks_m > max_tasks_n) {
+    const std::int32_t m_chunk = m / real_tasks;
+    for (int i = 0; i < real_tasks - 1; ++i) {
+      tasks->push_back(TaskRect(i * m_chunk, m_chunk, 0, n));
+    }
+    const std::int32_t last_m_offset = (real_tasks - 1) * m_chunk;
+    tasks->push_back(TaskRect(last_m_offset, m - last_m_offset, 0, n));
+  } else {
+    const std::int32_t n_chunk = n / real_tasks;
+    for (int i = 0; i < real_tasks - 1; ++i) {
+      tasks->push_back(TaskRect(0, m, i * n_chunk, n_chunk));
+    }
+    const std::int32_t last_n_offset = (real_tasks - 1) * n_chunk;
+    tasks->push_back(TaskRect(0, m, last_n_offset, n - last_n_offset));
+  }
+}
+
+template <typename IN_TYPE, typename OUT_TYPE, typename F>
+void MultiThreadedMatrixMatrix(gemmlowp::WorkersPool* pool,
+                               std::int32_t max_threads, std::uint8_t* scratch,
+                               const IN_TYPE* lhs, const IN_TYPE* rhs,
+                               std::int32_t m, std::int32_t n, std::int32_t k,
+                               OUT_TYPE* result, std::int32_t result_stride,
+                               const F& operation) {
+  max_threads = internal::ResolveMaxThreads(max_threads);
+
+  std::vector<internal::TaskRect> task_rects;
+  internal::PrepareTasks(max_threads, m, n, k, &task_rects);
+
+  if (task_rects.size() == 1) {
+    operation.ExecuteMatrixMatrix(scratch, lhs, rhs, m, n, k, result,
+                                  result_stride);
+    return;
+  }
+
+  std::uint8_t* task_scratch = scratch;
+  std::int32_t scratch_per_thread = operation.ScratchPerThread(m, n, k);
+  std::vector<Task*> tasks;
+  std::for_each(
+      task_rects.begin(), task_rects.end(),
+      [&tasks, &task_scratch, lhs, rhs, k, result, result_stride, operation,
+       scratch_per_thread](internal::TaskRect& rect) {
+        tasks.push_back(new internal::MetaTask<IN_TYPE, OUT_TYPE, F>(
+            task_scratch, lhs, rhs, rect, k, result, result_stride, operation));
+        task_scratch += scratch_per_thread;
+      });
+  pool->Execute(tasks);
+}
+
+}  // namespace internal
+}  // namespace meta
+}  // namespace gemmlowp
+
+#endif  // GEMMLOWP_META_MULTI_THREAD_COMMON_H_
diff --git a/meta/legacy_multi_thread_gemm.h b/meta/legacy_multi_thread_gemm.h
new file mode 100644
index 0000000..a1e48c3
--- /dev/null
+++ b/meta/legacy_multi_thread_gemm.h
@@ -0,0 +1,260 @@
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_MULTI_THREAD_GEMM_H_
+#define GEMMLOWP_META_MULTI_THREAD_GEMM_H_
+
+#include "../internal/common.h"
+
+#ifdef GEMMLOWP_NEON
+
+#include "legacy_multi_thread_common.h"
+#include "legacy_multi_thread_gemv.h"
+#include "legacy_operations_common.h"
+#include "legacy_single_thread_gemm.h"
+
+namespace gemmlowp {
+namespace meta {
+namespace internal {
+
+const std::int32_t kMaxCacheFriendlySize = 256 * 1024;
+
+template <typename IN_TYPE, typename OUT_TYPE, typename F>
+void CacheFriendlyMatrixMatrix(std::uint8_t* scratch, const IN_TYPE* lhs,
+                               const IN_TYPE* rhs, std::int32_t m,
+                               std::int32_t n, std::int32_t k, OUT_TYPE* result,
+                               std::int32_t result_stride, const F& operation) {
+  const std::int32_t rhs_size = n * k * sizeof(IN_TYPE);
+  if (rhs_size > kMaxCacheFriendlySize) {
+    const std::int32_t optimal_n =
+        std::max(1, 4 * (kMaxCacheFriendlySize / (k * 4)));
+    const std::int32_t chunks_count_less_one = n / optimal_n - 1;
+    const std::int32_t chunk_size = optimal_n * k;
+    for (int i = 0; i < chunks_count_less_one; ++i) {
+      operation.ExecuteCacheFriendlyMatrixMatrix(
+          scratch, lhs, rhs + i * chunk_size, m, optimal_n, k,
+          result + i * optimal_n, result_stride);
+    }
+    const std::int32_t n_left = n - chunks_count_less_one * optimal_n;
+    operation.ExecuteCacheFriendlyMatrixMatrix(
+        scratch, lhs, rhs + chunks_count_less_one * chunk_size, m, n_left, k,
+        result + chunks_count_less_one * optimal_n, result_stride);
+  } else {
+    operation.ExecuteCacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k,
+                                               result, result_stride);
+  }
+}
+
+class GemmQuantized8BitOperation : public Quantized8BitOperation {
+ public:
+  GemmQuantized8BitOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                             std::int32_t sum_offset, std::int32_t multiplier,
+                             std::int32_t shift)
+      : Quantized8BitOperation(lhs_offset, rhs_offset, sum_offset, multiplier,
+                               shift) {}
+
+  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
+                           const std::uint8_t* rhs, std::int32_t m,
+                           std::int32_t n, std::int32_t k, std::uint8_t* result,
+                           std::int32_t result_stride) const {
+    CacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k, result, result_stride,
+                              *this);
+  }
+
+  void ExecuteCacheFriendlyMatrixMatrix(std::uint8_t* scratch,
+                                        const std::uint8_t* lhs,
+                                        const std::uint8_t* rhs, std::int32_t m,
+                                        std::int32_t n, std::int32_t k,
+                                        std::uint8_t* result,
+                                        std::int32_t result_stride) const {
+    gemm_q8_strided(scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
+                    sum_offset, multiplier, shift, result, result_stride);
+  }
+
+  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
+                                       std::int32_t k) {
+    return 4 * kMaxCacheFriendlySize;
+  }
+};
+
+class GemmFloatOperation : public FloatOperation {
+ public:
+  GemmFloatOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                     float result_offset)
+      : FloatOperation(lhs_offset, rhs_offset, result_offset) {}
+
+  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
+                           const std::uint8_t* rhs, std::int32_t m,
+                           std::int32_t n, std::int32_t k, float* result,
+                           std::int32_t result_stride) const {
+    CacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k, result, result_stride,
+                              *this);
+  }
+
+  void ExecuteCacheFriendlyMatrixMatrix(std::uint8_t* scratch,
+                                        const std::uint8_t* lhs,
+                                        const std::uint8_t* rhs, std::int32_t m,
+                                        std::int32_t n, std::int32_t k,
+                                        float* result,
+                                        std::int32_t result_stride) const {
+    gemm_f_strided(scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
+                   result_offset, result, result_stride);
+  }
+
+  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
+                                       std::int32_t k) {
+    return 4 * kMaxCacheFriendlySize;
+  }
+};
+
+class GemmInt32Operation : public Int32Operation {
+ public:
+  GemmInt32Operation(std::int32_t lhs_offset, std::int32_t rhs_offset)
+      : Int32Operation(lhs_offset, rhs_offset) {}
+
+  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
+                           const std::uint8_t* rhs, std::int32_t m,
+                           std::int32_t n, std::int32_t k, std::int32_t* result,
+                           std::int32_t result_stride) const {
+    CacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k, result, result_stride,
+                              *this);
+  }
+
+  void ExecuteCacheFriendlyMatrixMatrix(std::uint8_t* scratch,
+                                        const std::uint8_t* lhs,
+                                        const std::uint8_t* rhs, std::int32_t m,
+                                        std::int32_t n, std::int32_t k,
+                                        std::int32_t* result,
+                                        std::int32_t result_stride) const {
+    gemm_i32_strided(scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset, result,
+                     result_stride);
+  }
+
+  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
+                                       std::int32_t k) {
+    return 4 * kMaxCacheFriendlySize;
+  }
+};
+
+}  // namespace internal
+
+std::int32_t gemm_q8_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
+                             std::int32_t max_threads) {
+  return internal::ResolveMaxThreads(max_threads) *
+         internal::GemmQuantized8BitOperation::ScratchPerThread(m, n, k);
+}
+
+void multi_thread_gemm_q8(gemmlowp::WorkersPool* pool, std::int32_t max_threads,
+                          std::uint8_t* scratch, const std::uint8_t* lhs,
+                          const std::uint8_t* rhs, std::int32_t m,
+                          std::int32_t n, std::int32_t k,
+                          std::int32_t lhs_offset, std::int32_t rhs_offset,
+                          std::int32_t sum_offset, std::int32_t multiplier,
+                          std::int32_t shift, std::uint8_t* result) {
+  if (m == 1) {
+    multi_thread_gemv_q8(pool, max_threads, scratch, lhs, rhs, n, k, lhs_offset,
+                         rhs_offset, sum_offset, multiplier, shift, result);
+    return;
+  } else if (n == 1) {
+    multi_thread_gemv_q8(pool, max_threads, scratch, rhs, lhs, m, k, rhs_offset,
+                         lhs_offset, sum_offset, multiplier, shift, result);
+    return;
+  }
+
+  max_threads = internal::ResolveMaxThreads(max_threads);
+  internal::GemmQuantized8BitOperation operation(lhs_offset, rhs_offset,
+                                                 sum_offset, multiplier, shift);
+  if (max_threads == 1) {
+    internal::CacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k, result, n,
+                                        operation);
+  } else {
+    internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, m,
+                                        n, k, result, n, operation);
+  }
+}
+
+std::int32_t gemm_f_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
+                            std::int32_t max_threads) {
+  return internal::ResolveMaxThreads(max_threads) *
+         internal::GemmFloatOperation::ScratchPerThread(m, n, k);
+}
+
+void multi_thread_gemm_f(gemmlowp::WorkersPool* pool, std::int32_t max_threads,
+                         std::uint8_t* scratch, const std::uint8_t* lhs,
+                         const std::uint8_t* rhs, std::int32_t m,
+                         std::int32_t n, std::int32_t k,
+                         std::int32_t lhs_offset, std::int32_t rhs_offset,
+                         float result_offset, float* result) {
+  if (m == 1) {
+    multi_thread_gemv_f(pool, max_threads, scratch, lhs, rhs, n, k, lhs_offset,
+                        rhs_offset, result_offset, result);
+    return;
+  } else if (n == 1) {
+    multi_thread_gemv_f(pool, max_threads, scratch, rhs, lhs, m, k, rhs_offset,
+                        lhs_offset, result_offset, result);
+    return;
+  }
+
+  max_threads = internal::ResolveMaxThreads(max_threads);
+  internal::GemmFloatOperation operation(lhs_offset, rhs_offset, result_offset);
+  if (max_threads == 1) {
+    internal::CacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k, result, n,
+                                        operation);
+  } else {
+    internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, m,
+                                        n, k, result, n, operation);
+  }
+}
+
+std::int32_t gemm_i32_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
+                              std::int32_t max_threads) {
+  return internal::ResolveMaxThreads(max_threads) *
+         internal::GemmInt32Operation::ScratchPerThread(m, n, k);
+}
+
+void multi_thread_gemm_i32(gemmlowp::WorkersPool* pool,
+                           std::int32_t max_threads, std::uint8_t* scratch,
+                           const std::uint8_t* lhs, const std::uint8_t* rhs,
+                           std::int32_t m, std::int32_t n, std::int32_t k,
+                           std::int32_t lhs_offset, std::int32_t rhs_offset,
+                           std::int32_t* result) {
+  if (m == 1) {
+    multi_thread_gemv_i32(pool, max_threads, scratch, lhs, rhs, n, k,
+                          lhs_offset, rhs_offset, result);
+    return;
+  } else if (n == 1) {
+    multi_thread_gemv_i32(pool, max_threads, scratch, rhs, lhs, m, k,
+                          rhs_offset, lhs_offset, result);
+    return;
+  }
+
+  max_threads = internal::ResolveMaxThreads(max_threads);
+  internal::GemmInt32Operation operation(lhs_offset, rhs_offset);
+  if (max_threads == 1) {
+    internal::CacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k, result, n,
+                                        operation);
+  } else {
+    internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, m,
+                                        n, k, result, n, operation);
+  }
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm fast-path requires GEMMLOWP_NEON_(32|64)!"
+#endif
+
+#endif  // GEMMLOWP_META_MULTI_THREAD_GEMM_H_
diff --git a/meta/legacy_multi_thread_gemv.h b/meta/legacy_multi_thread_gemv.h
new file mode 100644
index 0000000..7af5684
--- /dev/null
+++ b/meta/legacy_multi_thread_gemv.h
@@ -0,0 +1,168 @@
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// multi_thread_gemv.h: Entry point to the multithreaded version of the
+// generated (meta) gemv library.
+
+#ifndef GEMMLOWP_META_MULTI_THREAD_GEMV_H_
+#define GEMMLOWP_META_MULTI_THREAD_GEMV_H_
+
+#ifdef GEMMLOWP_NEON
+
+#include "legacy_multi_thread_common.h"
+#include "legacy_operations_common.h"
+#include "legacy_single_thread_gemm.h"
+
+namespace gemmlowp {
+namespace meta {
+namespace internal {
+
+class GemvQuantized8BitOperation : public Quantized8BitOperation {
+ public:
+  GemvQuantized8BitOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                             std::int32_t sum_offset, std::int32_t multiplier,
+                             std::int32_t shift)
+      : Quantized8BitOperation(lhs_offset, rhs_offset, sum_offset, multiplier,
+                               shift) {}
+
+  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
+                           const std::uint8_t* rhs, std::int32_t m,
+                           std::int32_t n, std::int32_t k, std::uint8_t* result,
+                           std::int32_t result_stride) const {
+    gemv_q8(scratch, lhs, rhs, n, k, lhs_offset, rhs_offset, sum_offset,
+            multiplier, shift, result);
+  }
+
+  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
+                                       std::int32_t k) {
+    return 128 * 1024;
+  }
+};
+
+class GemvFloatOperation : public FloatOperation {
+ public:
+  GemvFloatOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                     float result_offset)
+      : FloatOperation(lhs_offset, rhs_offset, result_offset) {}
+
+  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
+                           const std::uint8_t* rhs, std::int32_t m,
+                           std::int32_t n, std::int32_t k, float* result,
+                           std::int32_t result_stride) const {
+    gemv_f(scratch, lhs, rhs, n, k, lhs_offset, rhs_offset, result_offset,
+           result);
+  }
+
+  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
+                                       std::int32_t k) {
+    return 128 * 1024;
+  }
+};
+
+class GemvInt32Operation : public Int32Operation {
+ public:
+  GemvInt32Operation(std::int32_t lhs_offset, std::int32_t rhs_offset)
+      : Int32Operation(lhs_offset, rhs_offset) {}
+
+  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
+                           const std::uint8_t* rhs, std::int32_t m,
+                           std::int32_t n, std::int32_t k, std::int32_t* result,
+                           std::int32_t result_stride) const {
+    gemv_i32(scratch, lhs, rhs, n, k, lhs_offset, rhs_offset, result);
+  }
+
+  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
+                                       std::int32_t k) {
+    return 128 * 1024;
+  }
+};
+
+}  // namespace internal
+
+std::int32_t gemv_q8_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
+                             std::int32_t max_threads) {
+  return internal::ResolveMaxThreads(max_threads) *
+         internal::GemvQuantized8BitOperation::ScratchPerThread(m, n, k);
+}
+
+void multi_thread_gemv_q8(gemmlowp::WorkersPool* pool, std::int32_t max_threads,
+                          std::uint8_t* scratch, const std::uint8_t* lhs,
+                          const std::uint8_t* rhs, std::int32_t n,
+                          std::int32_t k, std::int32_t lhs_offset,
+                          std::int32_t rhs_offset, std::int32_t sum_offset,
+                          std::int32_t multiplier, std::int32_t shift,
+                          std::uint8_t* result) {
+  max_threads = internal::ResolveMaxThreads(max_threads);
+  internal::GemvQuantized8BitOperation operation(lhs_offset, rhs_offset,
+                                                 sum_offset, multiplier, shift);
+  if (max_threads == 1) {
+    operation.ExecuteMatrixMatrix(scratch, lhs, rhs, 1, n, k, result, n);
+  } else {
+    internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, 1,
+                                        n, k, result, n, operation);
+  }
+}
+
+std::int32_t gemv_f_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
+                            std::int32_t max_threads) {
+  return internal::ResolveMaxThreads(max_threads) *
+         internal::GemvFloatOperation::ScratchPerThread(m, n, k);
+}
+
+void multi_thread_gemv_f(gemmlowp::WorkersPool* pool, std::int32_t max_threads,
+                         std::uint8_t* scratch, const std::uint8_t* lhs,
+                         const std::uint8_t* rhs, std::int32_t n,
+                         std::int32_t k, std::int32_t lhs_offset,
+                         std::int32_t rhs_offset, float result_offset,
+                         float* result) {
+  max_threads = internal::ResolveMaxThreads(max_threads);
+  internal::GemvFloatOperation operation(lhs_offset, rhs_offset, result_offset);
+  if (max_threads == 1) {
+    operation.ExecuteMatrixMatrix(scratch, lhs, rhs, 1, n, k, result, n);
+  } else {
+    internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, 1,
+                                        n, k, result, n, operation);
+  }
+}
+
+std::int32_t gemv_i32_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
+                              std::int32_t max_threads) {
+  return internal::ResolveMaxThreads(max_threads) *
+         internal::GemvInt32Operation::ScratchPerThread(m, n, k);
+}
+
+void multi_thread_gemv_i32(gemmlowp::WorkersPool* pool,
+                           std::int32_t max_threads, std::uint8_t* scratch,
+                           const std::uint8_t* lhs, const std::uint8_t* rhs,
+                           std::int32_t n, std::int32_t k,
+                           std::int32_t lhs_offset, std::int32_t rhs_offset,
+                           std::int32_t* result) {
+  max_threads = internal::ResolveMaxThreads(max_threads);
+  internal::GemvInt32Operation operation(lhs_offset, rhs_offset);
+  if (max_threads == 1) {
+    operation.ExecuteMatrixMatrix(scratch, lhs, rhs, 1, n, k, result, n);
+  } else {
+    internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, 1,
+                                        n, k, result, n, operation);
+  }
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm fast-path requires GEMMLOWP_NEON_(32|64)!"
+#endif
+
+#endif  // GEMMLOWP_META_MULTI_THREAD_GEMV_H_
diff --git a/meta/legacy_operations_common.h b/meta/legacy_operations_common.h
new file mode 100644
index 0000000..fadfa3b
--- /dev/null
+++ b/meta/legacy_operations_common.h
@@ -0,0 +1,61 @@
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_OPERATIONS_COMMON_H_
+#define GEMMLOWP_META_OPERATIONS_COMMON_H_
+
+class Quantized8BitOperation {
+ public:
+  Quantized8BitOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                         std::int32_t sum_offset, std::int32_t multiplier,
+                         std::int32_t shift)
+      : lhs_offset(lhs_offset),
+        rhs_offset(rhs_offset),
+        sum_offset(sum_offset),
+        multiplier(multiplier),
+        shift(shift) {}
+
+ protected:
+  std::int32_t lhs_offset;
+  std::int32_t rhs_offset;
+  std::int32_t sum_offset;
+  std::int32_t multiplier;
+  std::int32_t shift;
+};
+
+class FloatOperation {
+ public:
+  FloatOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                 float result_offset)
+      : lhs_offset(lhs_offset),
+        rhs_offset(rhs_offset),
+        result_offset(result_offset) {}
+
+ protected:
+  std::int32_t lhs_offset;
+  std::int32_t rhs_offset;
+  float result_offset;
+};
+
+class Int32Operation {
+ public:
+  Int32Operation(std::int32_t lhs_offset, std::int32_t rhs_offset)
+      : lhs_offset(lhs_offset), rhs_offset(rhs_offset) {}
+
+ protected:
+  std::int32_t lhs_offset;
+  std::int32_t rhs_offset;
+};
+
+#endif  // GEMMLOWP_META_OPERATIONS_COMMON_H_
diff --git a/meta/legacy_single_thread_gemm.h b/meta/legacy_single_thread_gemm.h
new file mode 100644
index 0000000..d662e47
--- /dev/null
+++ b/meta/legacy_single_thread_gemm.h
@@ -0,0 +1,299 @@
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_LEGACY_SINGLE_THREAD_GEMM_H_
+#define GEMMLOWP_META_LEGACY_SINGLE_THREAD_GEMM_H_
+
+#include "../internal/common.h"
+
+#ifdef GEMMLOWP_NEON
+
+#include "quantized_mul_kernels.h"
+#include "single_thread_gemm.h"
+#include "streams.h"
+
+namespace gemmlowp {
+namespace meta {
+
+void gemm_q8_strided(std::uint8_t* scratch, const std::uint8_t* lhs,
+                     const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
+                     std::int32_t k, std::int32_t lhs_offset,
+                     std::int32_t rhs_offset, std::int32_t result_offset,
+                     std::int32_t multiplicative_offset, std::int32_t shift,
+                     std::uint8_t* result, std::int32_t result_stride) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_LEGACY_VERBOSE
+  std::cout << "Legacy::GemmQ8." << std::endl;
+#endif
+#endif
+  typedef GemmParams<std::uint8_t, std::uint8_t, RowMajorWithSum,
+                     RowMajorWithSum, QuantizedStaticPreprocessed, RowMajor>
+      Params;
+  Params params;
+
+  params.m = m;
+  params.n = n;
+  params.k = k;
+
+  params.lhs = lhs;
+  params.rhs = rhs;
+  params.result = result;
+  params.scratch = scratch;
+
+  params.left_stream.count = k;
+  params.left_stream.stride = k;
+  params.left_stream.multiplicative_sum_offset = rhs_offset;
+  params.left_stream.additive_sum_offset =
+      result_offset + k * lhs_offset * rhs_offset;
+
+  params.right_stream.count = k;
+  params.right_stream.stride = k;
+  params.right_stream.multiplicative_sum_offset = lhs_offset;
+  params.right_stream.additive_sum_offset = 0;
+
+  params.fused_kernel.kernel.multiplicative_offset = multiplicative_offset;
+  params.fused_kernel.kernel.rounding_offset = (1 << (shift - 1));
+  params.fused_kernel.kernel.shift = -shift;
+  params.fused_kernel.kernel.count = k;
+  params.fused_kernel.output_stream.stride = result_stride;
+
+  Gemm<GemmExecutorPackRHS, Params, 2, 4, 8>(params);
+}
+
+void gemv_q8(std::uint8_t* scratch, const std::uint8_t* lhs,
+             const std::uint8_t* rhs, std::int32_t n, std::int32_t k,
+             std::int32_t lhs_offset, std::int32_t rhs_offset,
+             std::int32_t result_offset, std::int32_t multiplicative_offset,
+             std::int32_t shift, std::uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_LEGACY_VERBOSE
+  std::cout << "Legacy::GemvQ8." << std::endl;
+#endif
+#endif
+  typedef GemmParams<std::uint8_t, std::uint8_t, RowMajorWithSum,
+                     RowMajorWithSum, QuantizedStaticPreprocessed, RowMajor>
+      Params;
+  Params params;
+
+  params.m = 1;
+  params.n = n;
+  params.k = k;
+
+  params.lhs = lhs;
+  params.rhs = rhs;
+  params.result = result;
+  params.scratch = scratch;
+
+  params.left_stream.count = k;
+  params.left_stream.stride = k;
+  params.left_stream.multiplicative_sum_offset = rhs_offset;
+  params.left_stream.additive_sum_offset =
+      result_offset + k * lhs_offset * rhs_offset;
+
+  params.right_stream.count = k;
+  params.right_stream.stride = k;
+  params.right_stream.multiplicative_sum_offset = lhs_offset;
+  params.right_stream.additive_sum_offset = 0;
+
+  params.fused_kernel.kernel.multiplicative_offset = multiplicative_offset;
+  params.fused_kernel.kernel.rounding_offset = (1 << (shift - 1));
+  params.fused_kernel.kernel.shift = -shift;
+  params.fused_kernel.kernel.count = k;
+  params.fused_kernel.output_stream.stride = n;
+
+  if (k < 1536) {
+    Gemm<GemmExecutorPackLHS, Params, 1, 8, 8>(params);
+  } else {
+    Gemm<GemmExecutorPackLHS, Params, 2, 4, 8>(params);
+  }
+}
+
+void gemm_i32_strided(std::uint8_t* scratch, const std::uint8_t* lhs,
+                      const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
+                      std::int32_t k, std::int32_t lhs_offset,
+                      std::int32_t rhs_offset, std::int32_t* result,
+                      std::int32_t result_stride) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_LEGACY_VERBOSE
+  std::cout << "Legacy::GemmI32." << std::endl;
+#endif
+#endif
+  typedef GemmParams<std::uint8_t, std::int32_t, RowMajorWithSum,
+                     RowMajorWithSum, QuantizedStaticPreprocessedAsInt32,
+                     RowMajor>
+      Params;
+  Params params;
+
+  params.m = m;
+  params.n = n;
+  params.k = k;
+
+  params.lhs = lhs;
+  params.rhs = rhs;
+  params.result = result;
+  params.scratch = scratch;
+
+  params.left_stream.count = k;
+  params.left_stream.stride = k;
+  params.left_stream.multiplicative_sum_offset = rhs_offset;
+  params.left_stream.additive_sum_offset = k * lhs_offset * rhs_offset;
+
+  params.right_stream.count = k;
+  params.right_stream.stride = k;
+  params.right_stream.multiplicative_sum_offset = lhs_offset;
+  params.right_stream.additive_sum_offset = 0;
+
+  params.fused_kernel.kernel.count = k;
+  params.fused_kernel.output_stream.stride = result_stride * 4;
+
+  Gemm<GemmExecutorPackRHS, Params, 2, 4, 8>(params);
+}
+
+void gemv_i32(std::uint8_t* scratch, const std::uint8_t* lhs,
+              const std::uint8_t* rhs, std::int32_t n, std::int32_t k,
+              std::int32_t lhs_offset, std::int32_t rhs_offset,
+              std::int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_LEGACY_VERBOSE
+  std::cout << "Legacy::GemvI32." << std::endl;
+#endif
+#endif
+  typedef GemmParams<std::uint8_t, std::int32_t, RowMajorWithSum,
+                     RowMajorWithSum, QuantizedStaticPreprocessedAsInt32,
+                     RowMajor>
+      Params;
+  Params params;
+
+  params.m = 1;
+  params.n = n;
+  params.k = k;
+
+  params.lhs = lhs;
+  params.rhs = rhs;
+  params.result = result;
+  params.scratch = scratch;
+
+  params.left_stream.count = k;
+  params.left_stream.stride = k;
+  params.left_stream.multiplicative_sum_offset = rhs_offset;
+  params.left_stream.additive_sum_offset = k * lhs_offset * rhs_offset;
+
+  params.right_stream.count = k;
+  params.right_stream.stride = k;
+  params.right_stream.multiplicative_sum_offset = lhs_offset;
+  params.right_stream.additive_sum_offset = 0;
+
+  params.fused_kernel.kernel.count = k;
+  params.fused_kernel.output_stream.stride = 0;
+
+  if (k < 1664) {
+    Gemm<GemmExecutorPackLHS, Params, 1, 8, 8>(params);
+  } else {
+    Gemm<GemmExecutorPackLHS, Params, 1, 6, 8>(params);
+  }
+}
+
+void gemm_f_strided(std::uint8_t* scratch, const std::uint8_t* lhs,
+                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
+                    std::int32_t k, std::int32_t lhs_offset,
+                    std::int32_t rhs_offset, float result_offset, float* result,
+                    std::int32_t result_stride) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_LEGACY_VERBOSE
+  std::cout << "Legacy::GemmF." << std::endl;
+#endif
+#endif
+  typedef GemmParams<std::uint8_t, float, RowMajorWithSum, RowMajorWithSum,
+                     QuantizedStaticPreprocessedAsFloat, RowMajor>
+      Params;
+  Params params;
+
+  params.m = m;
+  params.n = n;
+  params.k = k;
+
+  params.lhs = lhs;
+  params.rhs = rhs;
+  params.result = result;
+  params.scratch = scratch;
+
+  params.left_stream.count = k;
+  params.left_stream.stride = k;
+  params.left_stream.multiplicative_sum_offset = rhs_offset;
+  params.left_stream.additive_sum_offset = k * lhs_offset * rhs_offset;
+
+  params.right_stream.count = k;
+  params.right_stream.stride = k;
+  params.right_stream.multiplicative_sum_offset = lhs_offset;
+  params.right_stream.additive_sum_offset = 0;
+
+  params.fused_kernel.kernel.count = k;
+  params.fused_kernel.kernel.scale = result_offset;
+  params.fused_kernel.output_stream.stride = result_stride * 4;
+
+  Gemm<GemmExecutorPackRHS, Params, 2, 4, 8>(params);
+}
+
+void gemv_f(std::uint8_t* scratch, const std::uint8_t* lhs,
+            const std::uint8_t* rhs, std::int32_t n, std::int32_t k,
+            std::int32_t lhs_offset, std::int32_t rhs_offset,
+            float result_offset, float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_LEGACY_VERBOSE
+  std::cout << "Legacy::GemvF." << std::endl;
+#endif
+#endif
+  typedef GemmParams<std::uint8_t, float, RowMajorWithSum, RowMajorWithSum,
+                     QuantizedStaticPreprocessedAsFloat, RowMajor>
+      Params;
+  Params params;
+
+  params.m = 1;
+  params.n = n;
+  params.k = k;
+
+  params.lhs = lhs;
+  params.rhs = rhs;
+  params.result = result;
+  params.scratch = scratch;
+
+  params.left_stream.count = k;
+  params.left_stream.stride = k;
+  params.left_stream.multiplicative_sum_offset = rhs_offset;
+  params.left_stream.additive_sum_offset = k * lhs_offset * rhs_offset;
+
+  params.right_stream.count = k;
+  params.right_stream.stride = k;
+  params.right_stream.multiplicative_sum_offset = lhs_offset;
+  params.right_stream.additive_sum_offset = 0;
+
+  params.fused_kernel.kernel.count = k;
+  params.fused_kernel.kernel.scale = result_offset;
+  params.fused_kernel.output_stream.stride = 0;
+
+  if (k < 1664) {
+    Gemm<GemmExecutorPackLHS, Params, 1, 8, 8>(params);
+  } else {
+    Gemm<GemmExecutorPackLHS, Params, 1, 6, 8>(params);
+  }
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm fast-path requires GEMMLOWP_NEON_(32|64)!"
+#endif
+
+#endif  // GEMMLOWP_META_LEGACY_SINGLE_THREAD_GEMM_H_
diff --git a/meta/multi_thread_common.h b/meta/multi_thread_common.h
index 78e99dc..0b35759 100644
--- a/meta/multi_thread_common.h
+++ b/meta/multi_thread_common.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -12,9 +12,6 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
-// multi_thread_common.h: Multithreading code shared by different meta gemm
-// versions.
-
 #ifndef GEMMLOWP_META_MULTI_THREAD_COMMON_H_
 #define GEMMLOWP_META_MULTI_THREAD_COMMON_H_
 
@@ -22,56 +19,8 @@
 
 namespace gemmlowp {
 namespace meta {
-namespace internal {
 
-const std::int32_t kMinTaskSize = 10000;
-const std::int32_t kMinTaskDimension = 6;
-
-struct TaskRect {
-  std::int32_t m_offset;
-  std::int32_t m;
-  std::int32_t n_offset;
-  std::int32_t n;
-
-  TaskRect(std::int32_t m_offset, std::int32_t m, std::int32_t n_offset,
-           std::int32_t n)
-      : m_offset(m_offset), m(m), n_offset(n_offset), n(n) {}
-};
-
-template <typename IN_TYPE, typename OUT_TYPE, typename F>
-struct MetaTask : gemmlowp::Task {
-  std::uint8_t* scratch;
-  const IN_TYPE* lhs;
-  const IN_TYPE* rhs;
-  TaskRect task_rect;
-  std::int32_t k;
-  OUT_TYPE* result;
-  std::int32_t result_stride;
-  const F& operation;
-
-  MetaTask(std::uint8_t* scratch, const IN_TYPE* lhs, const IN_TYPE* rhs,
-           const TaskRect& task_rect, std::int32_t k, OUT_TYPE* result,
-           std::int32_t result_stride, const F& operation)
-      : scratch(scratch),
-        lhs(lhs),
-        rhs(rhs),
-        task_rect(task_rect),
-        k(k),
-        result(result),
-        result_stride(result_stride),
-        operation(operation) {}
-
-  void Run() const override {
-    const IN_TYPE* task_lhs = lhs + task_rect.m_offset * k;
-    const IN_TYPE* task_rhs = rhs + task_rect.n_offset * k;
-    OUT_TYPE* task_result =
-        result + task_rect.m_offset * result_stride + task_rect.n_offset;
-    operation.ExecuteMatrixMatrix(scratch, task_lhs, task_rhs, task_rect.m,
-                                  task_rect.n, k, task_result, result_stride);
-  }
-};
-
-std::int32_t ResolveMaxThreads(std::int32_t max_threads) {
+inline int ResolveMaxThreads(int max_threads) {
   if (max_threads == 0) {
     static const int hardware_threads_count =
         static_cast<int>(sysconf(_SC_NPROCESSORS_CONF));
@@ -80,83 +29,21 @@
   return max_threads;
 }
 
-void PrepareTasks(std::int32_t max_tasks, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::vector<internal::TaskRect>* tasks) {
-  const std::int32_t max_tasks_by_size = (m * n * k) / kMinTaskSize;
-  const std::int32_t max_tasks_m = m / kMinTaskDimension;
-  const std::int32_t max_tasks_n = n / kMinTaskDimension;
-  const std::int32_t max_tasks_dimension = std::max(max_tasks_m, max_tasks_n);
+template <typename WorkersPool>
+class SimpleContext {
+ public:
+  SimpleContext(int max_num_threads, WorkersPool* pool)
+      : max_num_threads_(max_num_threads), pool_(pool) {}
 
-  std::int32_t real_tasks = std::max(
-      1, std::min(max_tasks, std::min(max_tasks_by_size, max_tasks_dimension)));
+  WorkersPool* workers_pool() { return pool_; }
 
-  if (real_tasks == 1) {
-    tasks->push_back(TaskRect(0, m, 0, n));
-    return;
-  }
+  int max_num_threads() { return max_num_threads_; }
 
-  if (max_tasks_m > max_tasks_n) {
-    const std::int32_t m_chunk = m / real_tasks;
-    for (int i = 0; i < real_tasks - 1; ++i) {
-      tasks->push_back(TaskRect(i * m_chunk, m_chunk, 0, n));
-    }
-    const std::int32_t last_m_offset = (real_tasks - 1) * m_chunk;
-    tasks->push_back(TaskRect(last_m_offset, m - last_m_offset, 0, n));
-  } else {
-    const std::int32_t n_chunk = n / real_tasks;
-    for (int i = 0; i < real_tasks - 1; ++i) {
-      tasks->push_back(TaskRect(0, m, i * n_chunk, n_chunk));
-    }
-    const std::int32_t last_n_offset = (real_tasks - 1) * n_chunk;
-    tasks->push_back(TaskRect(0, m, last_n_offset, n - last_n_offset));
-  }
-}
+ private:
+  int max_num_threads_;
+  WorkersPool* pool_;
+};
 
-template <typename IN_TYPE, typename OUT_TYPE, typename F>
-void MultiThreadedMatrixMatrix(gemmlowp::WorkersPool* pool,
-                               std::int32_t max_threads, std::uint8_t* scratch,
-                               const IN_TYPE* lhs, const IN_TYPE* rhs,
-                               std::int32_t m, std::int32_t n, std::int32_t k,
-                               OUT_TYPE* result, std::int32_t result_stride,
-                               const F& operation) {
-  max_threads = internal::ResolveMaxThreads(max_threads);
-  if (max_threads > 1) {
-    pool->CreateWorkers(max_threads - 1);
-  }
-
-  std::vector<internal::TaskRect> task_rects;
-  internal::PrepareTasks(max_threads, m, n, k, &task_rects);
-
-  if (task_rects.size() == 1) {
-    operation.ExecuteMatrixMatrix(scratch, lhs, rhs, m, n, k, result,
-                                  result_stride);
-    return;
-  }
-
-  std::uint8_t* task_scratch = scratch;
-  std::int32_t scratch_per_thread = operation.ScratchPerThread(m, n, k);
-  std::int32_t worker_tasks = task_rects.size() - 1;
-  pool->counter_to_decrement_when_ready().Reset(worker_tasks);
-
-  for (std::int32_t i = 0; i < worker_tasks; ++i) {
-    auto task = new internal::MetaTask<IN_TYPE, OUT_TYPE, F>(
-        task_scratch, lhs, rhs, task_rects[i], k, result, result_stride,
-        operation);
-    pool->StartWorker(i, task);
-    task_scratch += scratch_per_thread;
-  }
-
-  {
-    internal::MetaTask<IN_TYPE, OUT_TYPE, F> master_task(
-        task_scratch, lhs, rhs, task_rects.back(), k, result, result_stride,
-        operation);
-    master_task.Run();
-  }
-
-  pool->counter_to_decrement_when_ready().Wait();
-}
-
-}  // namespace internal
 }  // namespace meta
 }  // namespace gemmlowp
 
diff --git a/meta/multi_thread_gemm.h b/meta/multi_thread_gemm.h
index 6162d51..bc569e8 100644
--- a/meta/multi_thread_gemm.h
+++ b/meta/multi_thread_gemm.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -12,14 +12,9 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
-// multi_thread_gemm.h: Entry point to the multithreaded version of the
-// generated (meta) gemm library.
-
 #ifndef GEMMLOWP_META_MULTI_THREAD_GEMM_H_
 #define GEMMLOWP_META_MULTI_THREAD_GEMM_H_
 
-#ifdef GEMMLOWP_NEON_32
-
 #include "multi_thread_common.h"
 #include "single_thread_gemm.h"
 
@@ -27,206 +22,123 @@
 namespace meta {
 namespace internal {
 
-const std::int32_t kMaxCacheFriendlySize = 24 * 1024;
+const std::int32_t kMinGemmTaskSize = 16000;
+const std::int32_t kMinGemmTaskDimension = 4;
 
-template <typename IN_TYPE, typename OUT_TYPE, typename F>
-void CacheFriendlyMatrixMatrix(std::uint8_t* scratch, const IN_TYPE* lhs,
-                               const IN_TYPE* rhs, std::int32_t m,
-                               std::int32_t n, std::int32_t k, OUT_TYPE* result,
-                               std::int32_t result_stride, const F& operation) {
-  const std::int32_t rhs_size = n * k * sizeof(IN_TYPE);
-  if (rhs_size > kMaxCacheFriendlySize) {
-    const std::int32_t optimal_n =
-        std::max(1, 3 * (kMaxCacheFriendlySize / (k * 3)));
-    const std::int32_t chunks_count_less_one = n / optimal_n - 1;
-    const std::int32_t chunk_size = optimal_n * k;
-    for (int i = 0; i < chunks_count_less_one; ++i) {
-      operation.ExecuteCacheFriendlyMatrixMatrix(
-          scratch, lhs, rhs + i * chunk_size, m, optimal_n, k,
-          result + i * optimal_n, result_stride);
-    }
-    const std::int32_t n_left = n - chunks_count_less_one * optimal_n;
-    operation.ExecuteCacheFriendlyMatrixMatrix(
-        scratch, lhs, rhs + chunks_count_less_one * chunk_size, m, n_left, k,
-        result + chunks_count_less_one * optimal_n, result_stride);
-  } else {
-    operation.ExecuteCacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k,
-                                               result, result_stride);
-  }
+template <typename Executor, typename Params>
+std::uint8_t* PrepareGemmTask(const Params& params, int kernel_m, int kernel_n,
+                              int kernel_k, std::uint8_t* scratch, int m_start,
+                              int m, int n_start, int n,
+                              std::vector<Params>* tasks) {
+  tasks->push_back(params);
+  Params& task = tasks->back();
+  task.scratch = scratch;
+
+  task.m = m;
+  task.lhs =
+      StreamUtil<typename Params::InType, typename Params::LeftStream>::Offset(
+          params.left_stream, params.lhs, m_start, 0);
+
+  task.n = n;
+  task.rhs =
+      StreamUtil<typename Params::InType, typename Params::RightStream>::Offset(
+          params.right_stream, params.rhs, n_start, 0);
+
+  task.result =
+      StreamUtil<typename Params::OutType, typename Params::OutputStream>::
+          Offset(params.fused_kernel.output_stream, params.result, m_start,
+                 n_start);
+
+  return scratch + Executor::template EstimateScratchSize<Params>(
+                       task, kernel_m, kernel_n, kernel_k);
 }
 
-class GemmQuantized8BitOperation {
- public:
-  GemmQuantized8BitOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
-                             std::int32_t sum_offset, std::int32_t multiplier,
-                             std::int32_t shift)
-      : lhs_offset(lhs_offset),
-        rhs_offset(rhs_offset),
-        sum_offset(sum_offset),
-        multiplier(multiplier),
-        shift(shift) {}
+template <typename MultiThreadingContext, typename Executor, typename Params>
+bool PrepareGemmTasks(MultiThreadingContext* context, const Params& params,
+                      int kernel_m, int kernel_n, int kernel_k,
+                      std::vector<Params>* task_params) {
+  const int max_threads = ResolveMaxThreads(context->max_num_threads());
+  const int max_tasks_by_size =
+      (params.m * params.n * params.k) / kMinGemmTaskSize;
+  const int max_tasks_m = params.m / kMinGemmTaskDimension;
+  const int max_tasks_n = params.n / kMinGemmTaskDimension;
+  const int max_tasks_dimension = std::max(max_tasks_m, max_tasks_n);
 
-  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k, std::uint8_t* result,
-                           std::int32_t result_stride) const {
-    CacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k, result, result_stride,
-                              *this);
+  const int real_tasks = std::max(
+      1,
+      std::min(max_threads, std::min(max_tasks_by_size, max_tasks_dimension)));
+
+  if (real_tasks == 1) {
+    return false;
   }
 
-  void ExecuteCacheFriendlyMatrixMatrix(std::uint8_t* scratch,
-                                        const std::uint8_t* lhs,
-                                        const std::uint8_t* rhs, std::int32_t m,
-                                        std::int32_t n, std::int32_t k,
-                                        std::uint8_t* result,
-                                        std::int32_t result_stride) const {
-    gemm_q8_strided(scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    sum_offset, multiplier, shift, result, result_stride);
+  std::uint8_t* scratch = params.scratch;
+
+  if (max_tasks_m > max_tasks_n) {
+    const int m_chunk = params.m / real_tasks;
+    for (int i = 0; i < real_tasks - 1; ++i) {
+      scratch = PrepareGemmTask<Executor, Params>(
+          params, kernel_m, kernel_n, kernel_k, scratch, i * m_chunk, m_chunk,
+          0, params.n, task_params);
+    }
+    const int sum_m = (real_tasks - 1) * m_chunk;
+    PrepareGemmTask<Executor, Params>(params, kernel_m, kernel_n, kernel_k,
+                                      scratch, sum_m, params.m - sum_m, 0,
+                                      params.n, task_params);
+  } else {
+    const int n_chunk = params.n / real_tasks;
+    for (int i = 0; i < real_tasks - 1; ++i) {
+      scratch = PrepareGemmTask<Executor, Params>(
+          params, kernel_m, kernel_n, kernel_k, scratch, 0, params.m,
+          i * n_chunk, n_chunk, task_params);
+    }
+    int sum_n = (real_tasks - 1) * n_chunk;
+    PrepareGemmTask<Executor, Params>(params, kernel_m, kernel_n, kernel_k,
+                                      scratch, 0, params.m, sum_n,
+                                      params.n - sum_n, task_params);
   }
 
-  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
-                                       std::int32_t k) {
-    return 128 * 1024;
+  return true;
+}
+
+template <typename Executor, typename Params, int kernel_m, int kernel_n,
+          int kernel_k>
+struct GemmTaskRunner : gemmlowp::Task {
+  GemmTaskRunner(const Params& params) : params(params) {}
+
+  void Run() override {
+    Gemm<Executor, Params, kernel_m, kernel_n, kernel_k>(params);
   }
 
- private:
-  std::int32_t lhs_offset;
-  std::int32_t rhs_offset;
-  std::int32_t sum_offset;
-  std::int32_t multiplier;
-  std::int32_t shift;
-};
-
-class GemmFloatOperation {
- public:
-  GemmFloatOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
-                     float result_offset)
-      : lhs_offset(lhs_offset),
-        rhs_offset(rhs_offset),
-        result_offset(result_offset) {}
-
-  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k, float* result,
-                           std::int32_t result_stride) const {
-    CacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k, result, result_stride,
-                              *this);
-  }
-
-  void ExecuteCacheFriendlyMatrixMatrix(std::uint8_t* scratch,
-                                        const std::uint8_t* lhs,
-                                        const std::uint8_t* rhs, std::int32_t m,
-                                        std::int32_t n, std::int32_t k,
-                                        float* result,
-                                        std::int32_t result_stride) const {
-    gemm_f_strided(scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                   result_offset, result, result_stride);
-  }
-
-  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
-                                       std::int32_t k) {
-    return 128 * 1024;
-  }
-
- private:
-  std::int32_t lhs_offset;
-  std::int32_t rhs_offset;
-  float result_offset;
-};
-
-class GemmInt32Operation {
- public:
-  GemmInt32Operation(std::int32_t lhs_offset, std::int32_t rhs_offset)
-      : lhs_offset(lhs_offset), rhs_offset(rhs_offset) {}
-
-  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k, std::int32_t* result,
-                           std::int32_t result_stride) const {
-    CacheFriendlyMatrixMatrix(scratch, lhs, rhs, m, n, k, result, result_stride,
-                              *this);
-  }
-
-  void ExecuteCacheFriendlyMatrixMatrix(std::uint8_t* scratch,
-                                        const std::uint8_t* lhs,
-                                        const std::uint8_t* rhs, std::int32_t m,
-                                        std::int32_t n, std::int32_t k,
-                                        std::int32_t* result,
-                                        std::int32_t result_stride) const {
-    gemm_i32_strided(scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset, result,
-                     result_stride);
-  }
-
-  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
-                                       std::int32_t k) {
-    return 128 * 1024;
-  }
-
- private:
-  std::int32_t lhs_offset;
-  std::int32_t rhs_offset;
+  Params params;
 };
 
 }  // namespace internal
 
-std::int32_t gemm_q8_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
-                             std::int32_t max_threads) {
-  return internal::ResolveMaxThreads(max_threads) *
-         internal::GemmQuantized8BitOperation::ScratchPerThread(m, n, k);
-}
+template <typename MultiThreadingContext, typename Executor, typename Params,
+          int kernel_m, int kernel_n, int kernel_k>
+inline void MultiThreadGemm(MultiThreadingContext* context,
+                            const Params& params) {
+  typedef internal::GemmTaskRunner<Executor, Params, kernel_m, kernel_n,
+                                   kernel_k>
+      TaskRunnerType;
 
-void multi_thread_gemm_q8(gemmlowp::WorkersPool* pool, std::int32_t max_threads,
-                          std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          std::int32_t sum_offset, std::int32_t multiplier,
-                          std::int32_t shift, std::uint8_t* result) {
-  internal::GemmQuantized8BitOperation operation(lhs_offset, rhs_offset,
-                                                 sum_offset, multiplier, shift);
-  internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, m,
-                                      n, k, result, n, operation);
-}
+  std::vector<Params> task_params;
+  if (!internal::PrepareGemmTasks<MultiThreadingContext, Executor, Params>(
+          context, params, kernel_m, kernel_n, kernel_k, &task_params)) {
+    Gemm<Executor, Params, kernel_m, kernel_n, kernel_k>(params);
+    return;
+  }
 
-std::int32_t gemm_f_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
-                            std::int32_t max_threads) {
-  return internal::ResolveMaxThreads(max_threads) *
-         internal::GemmFloatOperation::ScratchPerThread(m, n, k);
-}
-
-void multi_thread_gemm_f(gemmlowp::WorkersPool* pool, std::int32_t max_threads,
-                         std::uint8_t* scratch, const std::uint8_t* lhs,
-                         const std::uint8_t* rhs, std::int32_t m,
-                         std::int32_t n, std::int32_t k,
-                         std::int32_t lhs_offset, std::int32_t rhs_offset,
-                         float result_offset, float* result) {
-  internal::GemmFloatOperation operation(lhs_offset, rhs_offset, result_offset);
-  internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, m,
-                                      n, k, result, n, operation);
-}
-
-std::int32_t gemm_i32_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
-                              std::int32_t max_threads) {
-  return internal::ResolveMaxThreads(max_threads) *
-         internal::GemmInt32Operation::ScratchPerThread(m, n, k);
-}
-
-void multi_thread_gemm_i32(gemmlowp::WorkersPool* pool,
-                           std::int32_t max_threads, std::uint8_t* scratch,
-                           const std::uint8_t* lhs, const std::uint8_t* rhs,
-                           std::int32_t m, std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t* result) {
-  internal::GemmInt32Operation operation(lhs_offset, rhs_offset);
-  internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, m,
-                                      n, k, result, n, operation);
+  auto workers_pool = context->workers_pool();
+  std::vector<Task*> tasks;
+  std::for_each(task_params.begin(), task_params.end(), [tasks](Params* param) {
+    tasks.push_back(new TaskRunnerType(param));
+  });
+  workers_pool->Execute(tasks);
 }
 
 }  // namespace meta
 }  // namespace gemmlowp
 
-#else
-#warning "Meta gemm fast-path requires GEMMLOWP_NEON_32!"
-#endif
-
 #endif  // GEMMLOWP_META_MULTI_THREAD_GEMM_H_
diff --git a/meta/multi_thread_gemv.h b/meta/multi_thread_gemv.h
new file mode 100644
index 0000000..2e25ea8
--- /dev/null
+++ b/meta/multi_thread_gemv.h
@@ -0,0 +1,168 @@
+// Copyright 2015 Google Inc. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// multi_thread_gemv.h: Entry point to the multithreaded version of the
+// generated (meta) gemv library.
+
+#ifndef GEMMLOWP_META_MULTI_THREAD_GEMV_H_
+#define GEMMLOWP_META_MULTI_THREAD_GEMV_H_
+
+#ifdef GEMMLOWP_NEON_32
+
+#include "multi_thread_common.h"
+#include "operations_common.h"
+#include "single_thread_gemm.h"
+
+namespace gemmlowp {
+namespace meta {
+namespace internal {
+
+class GemvQuantized8BitOperation : public Quantized8BitOperation {
+ public:
+  GemvQuantized8BitOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                             std::int32_t sum_offset, std::int32_t multiplier,
+                             std::int32_t shift)
+      : Quantized8BitOperation(lhs_offset, rhs_offset, sum_offset, multiplier,
+                               shift) {}
+
+  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
+                           const std::uint8_t* rhs, std::int32_t m,
+                           std::int32_t n, std::int32_t k, std::uint8_t* result,
+                           std::int32_t result_stride) const {
+    gemv_q8(scratch, lhs, rhs, n, k, lhs_offset, rhs_offset, sum_offset,
+            multiplier, shift, result);
+  }
+
+  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
+                                       std::int32_t k) {
+    return 128 * 1024;
+  }
+};
+
+class GemvFloatOperation : public FloatOperation {
+ public:
+  GemvFloatOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                     float result_offset)
+      : FloatOperation(lhs_offset, rhs_offset, result_offset) {}
+
+  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
+                           const std::uint8_t* rhs, std::int32_t m,
+                           std::int32_t n, std::int32_t k, float* result,
+                           std::int32_t result_stride) const {
+    gemv_f(scratch, lhs, rhs, n, k, lhs_offset, rhs_offset, result_offset,
+           result);
+  }
+
+  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
+                                       std::int32_t k) {
+    return 128 * 1024;
+  }
+};
+
+class GemvInt32Operation : public Int32Operation {
+ public:
+  GemvInt32Operation(std::int32_t lhs_offset, std::int32_t rhs_offset)
+      : Int32Operation(lhs_offset, rhs_offset) {}
+
+  void ExecuteMatrixMatrix(std::uint8_t* scratch, const std::uint8_t* lhs,
+                           const std::uint8_t* rhs, std::int32_t m,
+                           std::int32_t n, std::int32_t k, std::int32_t* result,
+                           std::int32_t result_stride) const {
+    gemv_i32(scratch, lhs, rhs, n, k, lhs_offset, rhs_offset, result);
+  }
+
+  static std::int32_t ScratchPerThread(std::int32_t m, std::int32_t n,
+                                       std::int32_t k) {
+    return 128 * 1024;
+  }
+};
+
+}  // namespace internal
+
+std::int32_t gemv_q8_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
+                             std::int32_t max_threads) {
+  return internal::ResolveMaxThreads(max_threads) *
+         internal::GemvQuantized8BitOperation::ScratchPerThread(m, n, k);
+}
+
+void multi_thread_gemv_q8(gemmlowp::WorkersPool* pool, std::int32_t max_threads,
+                          std::uint8_t* scratch, const std::uint8_t* lhs,
+                          const std::uint8_t* rhs, std::int32_t n,
+                          std::int32_t k, std::int32_t lhs_offset,
+                          std::int32_t rhs_offset, std::int32_t sum_offset,
+                          std::int32_t multiplier, std::int32_t shift,
+                          std::uint8_t* result) {
+  max_threads = internal::ResolveMaxThreads(max_threads);
+  internal::GemvQuantized8BitOperation operation(lhs_offset, rhs_offset,
+                                                 sum_offset, multiplier, shift);
+  if (max_threads == 1) {
+    operation.ExecuteMatrixMatrix(scratch, lhs, rhs, 1, n, k, result, n);
+  } else {
+    internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, 1,
+                                        n, k, result, n, operation);
+  }
+}
+
+std::int32_t gemv_f_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
+                            std::int32_t max_threads) {
+  return internal::ResolveMaxThreads(max_threads) *
+         internal::GemvFloatOperation::ScratchPerThread(m, n, k);
+}
+
+void multi_thread_gemv_f(gemmlowp::WorkersPool* pool, std::int32_t max_threads,
+                         std::uint8_t* scratch, const std::uint8_t* lhs,
+                         const std::uint8_t* rhs, std::int32_t n,
+                         std::int32_t k, std::int32_t lhs_offset,
+                         std::int32_t rhs_offset, float result_offset,
+                         float* result) {
+  max_threads = internal::ResolveMaxThreads(max_threads);
+  internal::GemvFloatOperation operation(lhs_offset, rhs_offset, result_offset);
+  if (max_threads == 1) {
+    operation.ExecuteMatrixMatrix(scratch, lhs, rhs, 1, n, k, result, n);
+  } else {
+    internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, 1,
+                                        n, k, result, n, operation);
+  }
+}
+
+std::int32_t gemv_i32_scratch(std::int32_t m, std::int32_t n, std::int32_t k,
+                              std::int32_t max_threads) {
+  return internal::ResolveMaxThreads(max_threads) *
+         internal::GemvInt32Operation::ScratchPerThread(m, n, k);
+}
+
+void multi_thread_gemv_i32(gemmlowp::WorkersPool* pool,
+                           std::int32_t max_threads, std::uint8_t* scratch,
+                           const std::uint8_t* lhs, const std::uint8_t* rhs,
+                           std::int32_t n, std::int32_t k,
+                           std::int32_t lhs_offset, std::int32_t rhs_offset,
+                           std::int32_t* result) {
+  max_threads = internal::ResolveMaxThreads(max_threads);
+  internal::GemvInt32Operation operation(lhs_offset, rhs_offset);
+  if (max_threads == 1) {
+    operation.ExecuteMatrixMatrix(scratch, lhs, rhs, 1, n, k, result, n);
+  } else {
+    internal::MultiThreadedMatrixMatrix(pool, max_threads, scratch, lhs, rhs, 1,
+                                        n, k, result, n, operation);
+  }
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm fast-path requires GEMMLOWP_NEON_32!"
+#endif
+
+#endif  // GEMMLOWP_META_MULTI_THREAD_GEMV_H_
diff --git a/meta/multi_thread_transform.h b/meta/multi_thread_transform.h
new file mode 100644
index 0000000..d21aec1
--- /dev/null
+++ b/meta/multi_thread_transform.h
@@ -0,0 +1,98 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_MULTI_THREAD_TRANSFORM_H_
+#define GEMMLOWP_META_MULTI_THREAD_TRANSFORM_H_
+
+#include "multi_thread_common.h"
+#include "single_thread_transform.h"
+
+namespace gemmlowp {
+namespace meta {
+namespace internal {
+
+const int kTransformTaskOverhead = 128000;
+const int kMinTransformTaskSize = 32000;
+
+template <typename MultiThreadingContext, typename Params>
+inline bool PrepareTransform1DTasks(MultiThreadingContext* context,
+                                    const Params& params, int kernel_size,
+                                    std::vector<Params>* task_params) {
+  typedef Transform1DUtil<typename Params::InType, typename Params::OutType,
+                          typename Params::Kernel>
+      Util;
+
+  const int max_threads = ResolveMaxThreads(context->max_num_threads());
+  const int task_size = Util::EstimateComputeCost(params.kernel);
+  const int max_tasks_by_size =
+      (task_size - kTransformTaskOverhead) / kMinTransformTaskSize;
+
+  const int real_tasks = std::max(1, std::min(max_threads, max_tasks_by_size));
+
+  if (real_tasks == 1) {
+    return false;
+  }
+
+  const int chunk = params.kernel.count / real_tasks;
+  for (int i = 0; i < real_tasks - 1; ++i) {
+    task_params->push_back(params);
+    Params& task = task_params->back();
+    task.kernel.count = chunk;
+    task.input = Util::OffsetInput(params.kernel, params.input, i * chunk);
+    task.output = Util::OffsetOutput(params.kernel, params.output, i * chunk);
+  }
+  task_params->push_back(params);
+  Params& task = task_params->back();
+  const int sum_chunk = (real_tasks - 1) * chunk;
+  task.kernel.count = params.kernel.count - sum_chunk;
+  task.input = Util::OffsetInput(params.kernel, params.input, sum_chunk);
+  task.output = Util::OffsetOutput(params.kernel, params.output, sum_chunk);
+  return true;
+}
+
+template <typename Params, int kernel_size>
+struct Transform1DTaskRunner : gemmlowp::Task {
+  Transform1DTaskRunner(const Params& params) : params(params) {}
+
+  void Run() override { Transform1D<Params, kernel_size>(params); }
+
+  Params params;
+};
+
+}  // namespace internal
+
+template <typename MultiThreadingContext, typename Params, int kernel_size>
+inline void MultiThreadTransform1D(MultiThreadingContext* context,
+                                   const Params& params) {
+  typedef internal::Transform1DTaskRunner<Params, kernel_size> TaskRunnerType;
+
+  std::vector<Params> task_params;
+  if (!internal::PrepareTransform1DTasks<MultiThreadingContext, Params>(
+          context, params, kernel_size, &task_params)) {
+    Transform1D<Params, kernel_size>(params);
+    return;
+  }
+
+  auto workers_pool = context->workers_pool();
+  std::vector<Task*> tasks;
+  std::for_each(task_params.begin(), task_params.end(), [tasks](Params* param) {
+    tasks.push_back(new TaskRunnerType(param));
+  });
+  workers_pool->Execute(tasks);
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#endif  // GEMMLOWP_META_MULTI_THREAD_TRANSFORM_H_
diff --git a/meta/operations_common.h b/meta/operations_common.h
new file mode 100644
index 0000000..d80375a
--- /dev/null
+++ b/meta/operations_common.h
@@ -0,0 +1,61 @@
+// Copyright 2015 Google Inc. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_OPERATIONS_COMMON_H_
+#define GEMMLOWP_META_OPERATIONS_COMMON_H_
+
+class Quantized8BitOperation {
+ public:
+  Quantized8BitOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                         std::int32_t sum_offset, std::int32_t multiplier,
+                         std::int32_t shift)
+      : lhs_offset(lhs_offset),
+        rhs_offset(rhs_offset),
+        sum_offset(sum_offset),
+        multiplier(multiplier),
+        shift(shift) {}
+
+ protected:
+  std::int32_t lhs_offset;
+  std::int32_t rhs_offset;
+  std::int32_t sum_offset;
+  std::int32_t multiplier;
+  std::int32_t shift;
+};
+
+class FloatOperation {
+ public:
+  FloatOperation(std::int32_t lhs_offset, std::int32_t rhs_offset,
+                 float result_offset)
+      : lhs_offset(lhs_offset),
+        rhs_offset(rhs_offset),
+        result_offset(result_offset) {}
+
+ protected:
+  std::int32_t lhs_offset;
+  std::int32_t rhs_offset;
+  float result_offset;
+};
+
+class Int32Operation {
+ public:
+  Int32Operation(std::int32_t lhs_offset, std::int32_t rhs_offset)
+      : lhs_offset(lhs_offset), rhs_offset(rhs_offset) {}
+
+ protected:
+  std::int32_t lhs_offset;
+  std::int32_t rhs_offset;
+};
+
+#endif  // GEMMLOWP_META_OPERATIONS_COMMON_H_
diff --git a/meta/quantized_mul_kernels.h b/meta/quantized_mul_kernels.h
new file mode 100644
index 0000000..8b4bebe
--- /dev/null
+++ b/meta/quantized_mul_kernels.h
@@ -0,0 +1,177 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_QUANTIZED_MUL_KERNELS_H_
+#define GEMMLOWP_META_QUANTIZED_MUL_KERNELS_H_
+
+#include <iostream>
+#include <typeinfo>
+
+#include "base.h"
+#include "streams.h"
+
+namespace gemmlowp {
+namespace meta {
+
+struct QuantizedStaticPreprocessed {
+ public:
+  int multiplicative_offset;
+  int rounding_offset;
+  int shift;
+  int count;
+};
+
+template <typename InType, typename OutType, int m, int n, int k>
+class MulKernel<InType, OutType, QuantizedStaticPreprocessed, RowMajor, m, n,
+                k> {
+ public:
+  typedef FusedKernelParams<QuantizedStaticPreprocessed, RowMajor> FusedKernel;
+
+  static void Multiply(const InType* lhs, const InType*,
+                       const FusedKernel& params, OutType* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "MulQSPR(" << typeid(InType).name() << ", "
+              << typeid(OutType).name() << ")::Multiply() -- " << m << "x" << n
+              << "x" << k << std::endl;
+#endif
+#else
+    if (m != 0 && n != 0) {
+      std::cerr << "FATAL: QuantizedStaticPreprocessed_RowMajor::Multiply not "
+                << "implemented." << std::endl;
+      std::exit(1);
+    }
+#endif
+  }
+
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  static void Debug(const FusedKernel& params) {
+    std::cout << "MulQSPR(" << typeid(InType).name() << ", "
+              << typeid(OutType).name() << ") -- " << m << "x" << n << "x" << k
+              << std::endl;
+    std::cout << "  params:" << std::endl;
+    std::cout << "    kernel.multiplicative_offset: "
+              << params.kernel.multiplicative_offset << std::endl;
+    std::cout << "    kernel.rounding_offset: " << params.kernel.rounding_offset
+              << std::endl;
+    std::cout << "    kernel.shift: " << params.kernel.shift << std::endl;
+    std::cout << "    kernel.count: " << params.kernel.count << std::endl;
+    std::cout << "    output_stream.stride: " << params.output_stream.stride
+              << std::endl;
+  }
+#endif
+#endif
+};
+
+struct QuantizedStaticPreprocessedAsInt32 {
+ public:
+  int count;
+};
+
+template <typename InType, typename OutType, int m, int n, int k>
+class MulKernel<InType, OutType, QuantizedStaticPreprocessedAsInt32, RowMajor,
+                m, n, k> {
+ public:
+  typedef FusedKernelParams<QuantizedStaticPreprocessedAsInt32, RowMajor>
+      FusedKernel;
+
+  static void Multiply(const InType* lhs, const InType*,
+                       const FusedKernel& params, OutType* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "MulQSPI32R(" << typeid(InType).name() << ", "
+              << typeid(OutType).name() << ")::Multiply() -- " << m << "x" << n
+              << "x" << k << std::endl;
+#endif
+#else
+    if (m != 0 && n != 0) {
+      std::cerr << "FATAL: QuantizedStaticPreprocessedAsInt32_RowMajor::"
+                << "Multiply not implemented." << std::endl;
+      std::exit(1);
+    }
+#endif
+  }
+
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  static void Debug(const FusedKernel& params) {
+    std::cout << "MulQSPI32R(" << typeid(InType).name() << ", "
+              << typeid(OutType).name() << ") -- " << m << "x" << n << "x" << k
+              << std::endl;
+    std::cout << "  params:" << std::endl;
+    std::cout << "    kernel.count: " << params.kernel.count << std::endl;
+    std::cout << "    output_stream.stride: " << params.output_stream.stride
+              << std::endl;
+  }
+#endif
+#endif
+};
+
+struct QuantizedStaticPreprocessedAsFloat {
+ public:
+  int count;
+  float scale;
+};
+
+template <typename InType, typename OutType, int m, int n, int k>
+class MulKernel<InType, OutType, QuantizedStaticPreprocessedAsFloat, RowMajor,
+                m, n, k> {
+ public:
+  typedef FusedKernelParams<QuantizedStaticPreprocessedAsFloat, RowMajor>
+      FusedKernel;
+
+  static void Multiply(const InType* lhs, const InType*,
+                       const FusedKernel& params, OutType* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "MulQSPFR(" << typeid(InType).name() << ", "
+              << typeid(OutType).name() << ")::Multiply() -- " << m << "x" << n
+              << "x" << k << std::endl;
+#endif
+#else
+    if (m != 0 && n != 0) {
+      std::cerr << "FATAL: QuantizedStaticPreprocessedAsFloat_RowMajor::"
+                << "Multiply not implemented." << std::endl;
+      std::exit(1);
+    }
+#endif
+  }
+
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  static void Debug(const FusedKernel& params) {
+    std::cout << "MulQSPFR(" << typeid(InType).name() << ", "
+              << typeid(OutType).name() << ") -- " << m << "x" << n << "x" << k
+              << std::endl;
+    std::cout << "  params:" << std::endl;
+    std::cout << "    kernel.count: " << params.kernel.count << std::endl;
+    std::cout << "    kernel.scale: " << params.kernel.scale << std::endl;
+    std::cout << "    output_stream.stride: " << params.output_stream.stride
+              << std::endl;
+  }
+#endif
+#endif
+};
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#ifdef GEMMLOWP_NEON_32
+#include "quantized_mul_kernels_arm_32.h"
+#elif defined(GEMMLOWP_NEON_64)
+#include "quantized_mul_kernels_arm_64.h"
+#endif
+
+#endif  // GEMMLOWP_META_QUANTIZED_MUL_KERNELS_H_
diff --git a/meta/quantized_mul_kernels_arm_32.h b/meta/quantized_mul_kernels_arm_32.h
new file mode 100644
index 0000000..1619c9d
--- /dev/null
+++ b/meta/quantized_mul_kernels_arm_32.h
@@ -0,0 +1,4288 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_QUANTIZED_MUL_KERNELS_ARM_32_H_
+#define GEMMLOWP_META_QUANTIZED_MUL_KERNELS_ARM_32_H_
+
+#ifdef GEMMLOWP_NEON_32
+
+#include <cassert>
+#include <cstdint>
+
+namespace gemmlowp {
+namespace meta {
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 1,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 1, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d2}, [%[lhs]:64]!\n"
+      "vld1.32 {d3}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q2, d3, d2\n"
+      "vpadal.u16 q0, q2\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[multiplicative_offset]\n"
+      "vdup.32 q7, %[rounding_offset]\n"
+      "vdup.32 q8, %[shift]\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d0, d0, d0\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vmul.i32 q0, q0, q6\n"
+      "vadd.i32 q0, q0, q7\n"
+      "vshl.s32 q0, q0, q8\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      // RowMajorOutput::Output
+      "vst1.8 {d0[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d8", "d9", "d10", "d11", "d12",
+        "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 2,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 2, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d4}, [%[lhs]:64]!\n"
+      "vld1.32 {d5, d6}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q4, d5, d4\n"
+      "vmull.u8 q5, d6, d4\n"
+      "vpadal.u16 q0, q4\n"
+      "vpadal.u16 q1, q5\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[multiplicative_offset]\n"
+      "vdup.32 q7, %[rounding_offset]\n"
+      "vdup.32 q8, %[shift]\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d0, d0, d2\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vmul.i32 q0, q0, q6\n"
+      "vadd.i32 q0, q0, q7\n"
+      "vshl.s32 q0, q0, q8\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      // RowMajorOutput::Output
+      "vst1.16 {d0[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10", "d11",
+        "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 3,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 3, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d6}, [%[lhs]:64]!\n"
+      "vld1.32 {d7, d8, d9}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q5, d7, d6\n"
+      "vmull.u8 q6, d8, d6\n"
+      "vmull.u8 q7, d9, d6\n"
+      "vpadal.u16 q0, q5\n"
+      "vpadal.u16 q1, q6\n"
+      "vpadal.u16 q2, q7\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[multiplicative_offset]\n"
+      "vdup.32 q7, %[rounding_offset]\n"
+      "vdup.32 q8, %[shift]\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d4\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vmul.i32 q0, q0, q6\n"
+      "vadd.i32 q0, q0, q7\n"
+      "vshl.s32 q0, q0, q8\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      // RowMajorOutput::Output
+      "vst1.16 {d0[0]}, [%[result]]!\n"
+      "vst1.8 {d0[2]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 4,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 4, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d8}, [%[lhs]:64]!\n"
+      "vld1.32 {d9, d10, d11, d12}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q7, d9, d8\n"
+      "vmull.u8 q8, d10, d8\n"
+      "vmull.u8 q9, d11, d8\n"
+      "vmull.u8 q10, d12, d8\n"
+      "vpadal.u16 q0, q7\n"
+      "vpadal.u16 q1, q8\n"
+      "vpadal.u16 q2, q9\n"
+      "vpadal.u16 q3, q10\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[multiplicative_offset]\n"
+      "vdup.32 q7, %[rounding_offset]\n"
+      "vdup.32 q8, %[shift]\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vmul.i32 q0, q0, q6\n"
+      "vadd.i32 q0, q0, q7\n"
+      "vshl.s32 q0, q0, q8\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 5,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 5, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d10, d11, d12, d13}, [%[rhs]:64]!\n"
+      "vld1.32 {d14}, [%[lhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q8, d10, d14\n"
+      "vmull.u8 q9, d11, d14\n"
+      "vmull.u8 q10, d12, d14\n"
+      "vmull.u8 q11, d13, d14\n"
+      "vld1.32 {d10}, [%[rhs]:64]!\n"
+      "pld [%[rhs], #128]\n"
+      "vpadal.u16 q0, q8\n"
+      "vpadal.u16 q1, q9\n"
+      "vpadal.u16 q2, q10\n"
+      "vpadal.u16 q3, q11\n"
+      "vmull.u8 q8, d10, d14\n"
+      "vpadal.u16 q4, q8\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d10, d11}, [%[lhs]:64]!\n"
+      "vld1.32 {d12, d13, d14, d15}, [%[rhs]:64]!\n"
+      "vdup.32 q8, %[multiplicative_offset]\n"
+      "vdup.32 q9, %[rounding_offset]\n"
+      "vdup.32 q10, %[shift]\n"
+      "vdup.32 q5, d10[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d8\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q1, q1, q5\n"
+      "vadd.s32 q0, q0, q6\n"
+      "vadd.s32 q1, q1, q7\n"
+      "vmul.i32 q0, q0, q8\n"
+      "vmul.i32 q1, q1, q8\n"
+      "vadd.i32 q0, q0, q9\n"
+      "vadd.i32 q1, q1, q9\n"
+      "vshl.s32 q0, q0, q10\n"
+      "vshl.s32 q1, q1, q10\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      "vst1.8 {d0[4]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 6,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 6, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d12, d13, d14, d15}, [%[rhs]:64]!\n"
+      "vld1.32 {d16}, [%[lhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q9, d12, d16\n"
+      "vmull.u8 q10, d13, d16\n"
+      "vmull.u8 q11, d14, d16\n"
+      "vmull.u8 q12, d15, d16\n"
+      "vld1.32 {d12, d13}, [%[rhs]:64]!\n"
+      "pld [%[rhs], #128]\n"
+      "vpadal.u16 q0, q9\n"
+      "vpadal.u16 q1, q10\n"
+      "vpadal.u16 q2, q11\n"
+      "vpadal.u16 q3, q12\n"
+      "vmull.u8 q9, d12, d16\n"
+      "vmull.u8 q10, d13, d16\n"
+      "vpadal.u16 q4, q9\n"
+      "vpadal.u16 q5, q10\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15, d16, d17}, [%[rhs]:64]!\n"
+      "vdup.32 q9, %[multiplicative_offset]\n"
+      "vdup.32 q10, %[rounding_offset]\n"
+      "vdup.32 q11, %[shift]\n"
+      "vdup.32 q6, d12[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d10\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q6\n"
+      "vadd.s32 q1, q1, q6\n"
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q1, q1, q8\n"
+      "vmul.i32 q0, q0, q9\n"
+      "vmul.i32 q1, q1, q9\n"
+      "vadd.i32 q0, q0, q10\n"
+      "vadd.i32 q1, q1, q10\n"
+      "vshl.s32 q0, q0, q11\n"
+      "vshl.s32 q1, q1, q11\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      "vst1.16 {d0[2]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 7,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 7, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d14, d15, d16, d17}, [%[rhs]:64]!\n"
+      "vld1.32 {d18}, [%[lhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q10, d14, d18\n"
+      "vmull.u8 q11, d15, d18\n"
+      "vmull.u8 q12, d16, d18\n"
+      "vmull.u8 q13, d17, d18\n"
+      "vld1.32 {d14, d15, d16}, [%[rhs]:64]!\n"
+      "pld [%[rhs], #128]\n"
+      "vpadal.u16 q0, q10\n"
+      "vpadal.u16 q1, q11\n"
+      "vpadal.u16 q2, q12\n"
+      "vpadal.u16 q3, q13\n"
+      "vmull.u8 q10, d14, d18\n"
+      "vmull.u8 q11, d15, d18\n"
+      "vmull.u8 q12, d16, d18\n"
+      "vpadal.u16 q4, q10\n"
+      "vpadal.u16 q5, q11\n"
+      "vpadal.u16 q6, q12\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d14, d15}, [%[lhs]:64]!\n"
+      "vld1.32 {d16, d17, d18, d19}, [%[rhs]:64]!\n"
+      "vdup.32 q10, %[multiplicative_offset]\n"
+      "vdup.32 q11, %[rounding_offset]\n"
+      "vdup.32 q12, %[shift]\n"
+      "vdup.32 q7, d14[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d10\n"
+      "vpadd.u32 d3, d12, d12\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q1, q1, q7\n"
+      "vadd.s32 q0, q0, q8\n"
+      "vadd.s32 q1, q1, q9\n"
+      "vmul.i32 q0, q0, q10\n"
+      "vmul.i32 q1, q1, q10\n"
+      "vadd.i32 q0, q0, q11\n"
+      "vadd.i32 q1, q1, q11\n"
+      "vshl.s32 q0, q0, q12\n"
+      "vshl.s32 q1, q1, q12\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      "vst1.16 {d0[2]}, [%[result]]!\n"
+      "vst1.8 {d0[6]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 8,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 8, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+      "vmov.i32 q7, q4\n"
+
+      // 1x8 lanes loop.
+      "1:"
+
+      "vld1.32 {d17, d18, d19, d20}, [%[rhs]:256]!\n"
+      "vld1.32 {d16}, [%[lhs]:64]!\n"
+      "vmull.u8 q11, d16, d17\n"
+      "vmull.u8 q12, d16, d18\n"
+      "vmull.u8 q13, d16, d19\n"
+      "vmull.u8 q14, d16, d20\n"
+      "vld1.32 {d17, d18, d19, d20}, [%[rhs]:256]!\n"
+      "vpadal.u16 q0, q11\n"
+      "vpadal.u16 q1, q12\n"
+      "vpadal.u16 q2, q13\n"
+      "vpadal.u16 q3, q14\n"
+      "pld [%[rhs], #256]\n"
+      "vmull.u8 q15, d16, d17\n"
+      "vmull.u8 q11, d16, d18\n"
+      "vmull.u8 q12, d16, d19\n"
+      "vmull.u8 q13, d16, d20\n"
+      "pld [%[lhs], #32]\n"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vpadal.u16 q4, q15\n"
+      "vpadal.u16 q5, q11\n"
+      "vpadal.u16 q6, q12\n"
+      "vpadal.u16 q7, q13\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d16, d17}, [%[lhs]:64]!\n"
+      "vld1.32 {d18, d19, d20, d21}, [%[rhs]:64]!\n"
+      "vdup.32 q11, %[multiplicative_offset]\n"
+      "vdup.32 q12, %[rounding_offset]\n"
+      "vdup.32 q13, %[shift]\n"
+      "vdup.32 q8, d16[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d14, d14, d15\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d10\n"
+      "vpadd.u32 d3, d12, d14\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q8\n"
+      "vadd.s32 q1, q1, q8\n"
+      "vadd.s32 q0, q0, q9\n"
+      "vadd.s32 q1, q1, q10\n"
+      "vmul.i32 q0, q0, q11\n"
+      "vmul.i32 q1, q1, q11\n"
+      "vadd.i32 q0, q0, q12\n"
+      "vadd.i32 q1, q1, q12\n"
+      "vshl.s32 q0, q0, q13\n"
+      "vshl.s32 q1, q1, q13\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
+        "d31", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 2, 1,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 2, 1, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d4, d5}, [%[lhs]:64]!\n"
+      "vld1.32 {d6}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q4, d6, d4\n"
+      "vmull.u8 q5, d6, d5\n"
+      "vpadal.u16 q0, q4\n"
+      "vpadal.u16 q1, q5\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[multiplicative_offset]\n"
+      "vdup.32 q7, %[rounding_offset]\n"
+      "vdup.32 q8, %[shift]\n"
+      "vdup.32 q2, d8[0]\n"
+      "vdup.32 q4, d8[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d0, d0, d0\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d2, d2, d2\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q2\n"
+      "vadd.s32 q1, q1, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q1, q1, q5\n"
+      "vmul.i32 q0, q0, q6\n"
+      "vmul.i32 q1, q1, q6\n"
+      "vadd.i32 q0, q0, q7\n"
+      "vadd.i32 q1, q1, q7\n"
+      "vshl.s32 q0, q0, q8\n"
+      "vshl.s32 q1, q1, q8\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d2, q1\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d2, q1\n"
+
+      // RowMajorOutput::Output
+      "vst1.8 {d0[0]}, [%[result]]!\n"
+      "vst1.8 {d2[0]}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 2, 2,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 2, 2, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q6, d10, d8\n"
+      "vmull.u8 q7, d11, d8\n"
+      "vmull.u8 q8, d10, d9\n"
+      "vmull.u8 q9, d11, d9\n"
+      "vpadal.u16 q0, q6\n"
+      "vpadal.u16 q1, q7\n"
+      "vpadal.u16 q2, q8\n"
+      "vpadal.u16 q3, q9\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[multiplicative_offset]\n"
+      "vdup.32 q7, %[rounding_offset]\n"
+      "vdup.32 q8, %[shift]\n"
+      "vdup.32 q9, d8[0]\n"
+      "vdup.32 q4, d8[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d4, d4, d6\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q9\n"
+      "vadd.s32 q2, q2, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q2, q2, q5\n"
+      "vmul.i32 q0, q0, q6\n"
+      "vmul.i32 q2, q2, q6\n"
+      "vadd.i32 q0, q0, q7\n"
+      "vadd.i32 q2, q2, q7\n"
+      "vshl.s32 q0, q0, q8\n"
+      "vshl.s32 q2, q2, q8\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d4, q2\n"
+
+      // RowMajorOutput::Output
+      "vst1.16 {d0[0]}, [%[result]]!\n"
+      "vst1.16 {d4[0]}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 2, 3,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 2, 3, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15, d16}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q9, d14, d12\n"
+      "vmull.u8 q10, d15, d12\n"
+      "vmull.u8 q11, d16, d12\n"
+      "vmull.u8 q12, d14, d13\n"
+      "vmull.u8 q13, d15, d13\n"
+      "vmull.u8 q14, d16, d13\n"
+      "vpadal.u16 q0, q9\n"
+      "vpadal.u16 q1, q10\n"
+      "vpadal.u16 q2, q11\n"
+      "vpadal.u16 q3, q12\n"
+      "vpadal.u16 q4, q13\n"
+      "vpadal.u16 q5, q14\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15}, [%[rhs]:64]!\n"
+      "vdup.32 q8, %[multiplicative_offset]\n"
+      "vdup.32 q9, %[rounding_offset]\n"
+      "vdup.32 q10, %[shift]\n"
+      "vdup.32 q11, d12[0]\n"
+      "vdup.32 q6, d12[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d4\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d6, d6, d8\n"
+      "vpadd.u32 d7, d10, d10\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q11\n"
+      "vadd.s32 q3, q3, q6\n"
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q3, q3, q7\n"
+      "vmul.i32 q0, q0, q8\n"
+      "vmul.i32 q3, q3, q8\n"
+      "vadd.i32 q0, q0, q9\n"
+      "vadd.i32 q3, q3, q9\n"
+      "vshl.s32 q0, q0, q10\n"
+      "vshl.s32 q3, q3, q10\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d6, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d6, q3\n"
+
+      // RowMajorOutput::Output
+      "vst1.16 {d0[0]}, [%[result]]!\n"
+      "vst1.8 {d0[2]}, [%[result]]!\n"
+      "vst1.16 {d6[0]}, [r0]!\n"
+      "vst1.8 {d6[2]}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "cc",
+        "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 2, 4,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 2, 4, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+      "vmov.i32 q7, q4\n"
+
+      // 2x4 lanes loop.
+      "1:"
+
+      "vld1.8 {d18, d19, d20, d21}, [%[rhs]:256]!\n"
+      "vld1.8 {d16}, [%[lhs]:64]!\n"
+      "vmull.u8 q11, d16, d18\n"
+      "vld1.8 {d17}, [%[lhs]:64]!\n"
+      "vmull.u8 q12, d16, d19\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q13, d16, d20\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q14, d16, d21\n"
+      "vmull.u8 q15, d17, d18\n"
+      "vpadal.u16 q0, q11\n"
+      "vpadal.u16 q1, q12\n"
+      "vpadal.u16 q2, q13\n"
+      "vmull.u8 q11, d17, d19\n"
+      "vmull.u8 q12, d17, d20\n"
+      "vmull.u8 q13, d17, d21\n"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vpadal.u16 q3, q14\n"
+      "vpadal.u16 q4, q15\n"
+      "vpadal.u16 q5, q11\n"
+      "vpadal.u16 q6, q12\n"
+      "vpadal.u16 q7, q13\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d16, d17}, [%[lhs]:64]!\n"
+      "vld1.32 {d18, d19}, [%[rhs]:64]!\n"
+      "vdup.32 q10, %[multiplicative_offset]\n"
+      "vdup.32 q11, %[rounding_offset]\n"
+      "vdup.32 q12, %[shift]\n"
+      "vdup.32 q13, d16[0]\n"
+      "vdup.32 q8, d16[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d14, d14, d15\n"
+      "vpadd.u32 d8, d8, d10\n"
+      "vpadd.u32 d9, d12, d14\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q13\n"
+      "vadd.s32 q4, q4, q8\n"
+      "vadd.s32 q0, q0, q9\n"
+      "vadd.s32 q4, q4, q9\n"
+      "vmul.i32 q0, q0, q10\n"
+      "vmul.i32 q4, q4, q10\n"
+      "vadd.i32 q0, q0, q11\n"
+      "vadd.i32 q4, q4, q11\n"
+      "vshl.s32 q0, q0, q12\n"
+      "vshl.s32 q4, q4, q12\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d8, q4\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d8, q4\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      "vst1.32 {d8[0]}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
+        "d31", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 3, 1,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 3, 1, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d6, d7, d8}, [%[lhs]:64]!\n"
+      "vld1.32 {d9}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q5, d9, d6\n"
+      "vmull.u8 q6, d9, d7\n"
+      "vmull.u8 q7, d9, d8\n"
+      "vpadal.u16 q0, q5\n"
+      "vpadal.u16 q1, q6\n"
+      "vpadal.u16 q2, q7\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[multiplicative_offset]\n"
+      "vdup.32 q7, %[rounding_offset]\n"
+      "vdup.32 q8, %[shift]\n"
+      "vdup.32 q3, d8[0]\n"
+      "vdup.32 q9, d8[1]\n"
+      "vdup.32 q4, d9[0]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d0, d0, d0\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d2, d2, d2\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d4, d4, d4\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q3\n"
+      "vadd.s32 q1, q1, q9\n"
+      "vadd.s32 q2, q2, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q1, q1, q5\n"
+      "vadd.s32 q2, q2, q5\n"
+      "vmul.i32 q0, q0, q6\n"
+      "vmul.i32 q1, q1, q6\n"
+      "vmul.i32 q2, q2, q6\n"
+      "vadd.i32 q0, q0, q7\n"
+      "vadd.i32 q1, q1, q7\n"
+      "vadd.i32 q2, q2, q7\n"
+      "vshl.s32 q0, q0, q8\n"
+      "vshl.s32 q1, q1, q8\n"
+      "vshl.s32 q2, q2, q8\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d2, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d2, q1\n"
+      "vqmovun.s16 d4, q2\n"
+
+      // RowMajorOutput::Output
+      "vst1.8 {d0[0]}, [%[result]]!\n"
+      "vst1.8 {d2[0]}, [r0]!\n"
+      "vst1.8 {d4[0]}, [r1]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 3, 2,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 3, 2, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d12, d13, d14}, [%[lhs]:64]!\n"
+      "vld1.32 {d15, d16}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q9, d15, d12\n"
+      "vmull.u8 q10, d16, d12\n"
+      "vmull.u8 q11, d15, d13\n"
+      "vmull.u8 q12, d16, d13\n"
+      "vmull.u8 q13, d15, d14\n"
+      "vmull.u8 q14, d16, d14\n"
+      "vpadal.u16 q0, q9\n"
+      "vpadal.u16 q1, q10\n"
+      "vpadal.u16 q2, q11\n"
+      "vpadal.u16 q3, q12\n"
+      "vpadal.u16 q4, q13\n"
+      "vpadal.u16 q5, q14\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15}, [%[rhs]:64]!\n"
+      "vdup.32 q8, %[multiplicative_offset]\n"
+      "vdup.32 q9, %[rounding_offset]\n"
+      "vdup.32 q10, %[shift]\n"
+      "vdup.32 q11, d12[0]\n"
+      "vdup.32 q12, d12[1]\n"
+      "vdup.32 q6, d13[0]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d4, d4, d6\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d8, d8, d10\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q11\n"
+      "vadd.s32 q2, q2, q12\n"
+      "vadd.s32 q4, q4, q6\n"
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q2, q2, q7\n"
+      "vadd.s32 q4, q4, q7\n"
+      "vmul.i32 q0, q0, q8\n"
+      "vmul.i32 q2, q2, q8\n"
+      "vmul.i32 q4, q4, q8\n"
+      "vadd.i32 q0, q0, q9\n"
+      "vadd.i32 q2, q2, q9\n"
+      "vadd.i32 q4, q4, q9\n"
+      "vshl.s32 q0, q0, q10\n"
+      "vshl.s32 q2, q2, q10\n"
+      "vshl.s32 q4, q4, q10\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d8, q4\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d4, q2\n"
+      "vqmovun.s16 d8, q4\n"
+
+      // RowMajorOutput::Output
+      "vst1.16 {d0[0]}, [%[result]]!\n"
+      "vst1.16 {d4[0]}, [r0]!\n"
+      "vst1.16 {d8[0]}, [r1]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 3, 3,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 3, 3, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+      "vmov.i32 q7, q4\n"
+      "vmov.i32 q8, q5\n"
+
+      // 3x3 lanes loop.
+      "1:"
+
+      "vld1.8 {d21, d22, d23}, [%[rhs]:64]!\n"
+      "vld1.8 {d18}, [%[lhs]:64]!\n"
+      "vmull.u8 q12, d18, d21\n"
+      "vld1.8 {d19}, [%[lhs]:64]!\n"
+      "vmull.u8 q13, d18, d22\n"
+      "vld1.8 {d20}, [%[lhs]:64]!\n"
+      "vmull.u8 q14, d18, d23\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q15, d19, d21\n"
+      "pld [%[rhs], #64]\n"
+      "vpadal.u16 q0, q12\n"
+      "vpadal.u16 q1, q13\n"
+      "vpadal.u16 q2, q14\n"
+      "vpadal.u16 q3, q15\n"
+      "vmull.u8 q12, d19, d22\n"
+      "vmull.u8 q13, d19, d23\n"
+      "vmull.u8 q14, d20, d21\n"
+      "vmull.u8 q15, d20, d22\n"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vmull.u8 q9, d20, d23\n"
+      "vpadal.u16 q4, q12\n"
+      "vpadal.u16 q5, q13\n"
+      "vpadal.u16 q6, q14\n"
+      "vpadal.u16 q7, q15\n"
+      "vpadal.u16 q8, q9\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "vld1.32 {d18, d19}, [%[lhs]:64]!\n"
+      "vld1.32 {d20, d21}, [%[rhs]:64]!\n"
+      "vdup.32 q11, %[multiplicative_offset]\n"
+      "vdup.32 q12, %[rounding_offset]\n"
+      "vdup.32 q13, %[shift]\n"
+      "vdup.32 q14, d18[0]\n"
+      "vdup.32 q15, d18[1]\n"
+      "vdup.32 q9, d19[0]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d4\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d6, d6, d8\n"
+      "vpadd.u32 d7, d10, d10\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d14, d14, d15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d12, d12, d14\n"
+      "vpadd.u32 d13, d16, d16\n"
+
+      // StaticQuantization::Transform
+      "vadd.s32 q0, q0, q14\n"
+      "vadd.s32 q3, q3, q15\n"
+      "vadd.s32 q6, q6, q9\n"
+      "vadd.s32 q0, q0, q10\n"
+      "vadd.s32 q3, q3, q10\n"
+      "vadd.s32 q6, q6, q10\n"
+      "vmul.i32 q0, q0, q11\n"
+      "vmul.i32 q3, q3, q11\n"
+      "vmul.i32 q6, q6, q11\n"
+      "vadd.i32 q0, q0, q12\n"
+      "vadd.i32 q3, q3, q12\n"
+      "vadd.i32 q6, q6, q12\n"
+      "vshl.s32 q0, q0, q13\n"
+      "vshl.s32 q3, q3, q13\n"
+      "vshl.s32 q6, q6, q13\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d6, q3\n"
+      "vqmovn.s32 d12, q6\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d6, q3\n"
+      "vqmovun.s16 d12, q6\n"
+
+      // RowMajorOutput::Output
+      "vst1.16 {d0[0]}, [%[result]]!\n"
+      "vst1.8 {d0[2]}, [%[result]]!\n"
+      "vst1.16 {d6[0]}, [r0]!\n"
+      "vst1.8 {d6[2]}, [r0]!\n"
+      "vst1.16 {d12[0]}, [r1]!\n"
+      "vst1.8 {d12[2]}, [r1]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "d30", "d31", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d2}, [%[lhs]:64]!\n"
+      "vld1.32 {d3}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q2, d3, d2\n"
+      "vpadal.u16 q0, q2\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d0, d0, d0\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d8", "d9", "d10", "d11", "cc",
+        "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d4}, [%[lhs]:64]!\n"
+      "vld1.32 {d5, d6}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q4, d5, d4\n"
+      "vmull.u8 q5, d6, d4\n"
+      "vpadal.u16 q0, q4\n"
+      "vpadal.u16 q1, q5\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d0, d0, d2\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10", "d11",
+        "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d6}, [%[lhs]:64]!\n"
+      "vld1.32 {d7, d8, d9}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q5, d7, d6\n"
+      "vmull.u8 q6, d8, d6\n"
+      "vmull.u8 q7, d9, d6\n"
+      "vpadal.u16 q0, q5\n"
+      "vpadal.u16 q1, q6\n"
+      "vpadal.u16 q2, q7\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d4\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d1[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 4,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 4, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d8}, [%[lhs]:64]!\n"
+      "vld1.32 {d9, d10, d11, d12}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q7, d9, d8\n"
+      "vmull.u8 q8, d10, d8\n"
+      "vmull.u8 q9, d11, d8\n"
+      "vmull.u8 q10, d12, d8\n"
+      "vpadal.u16 q0, q7\n"
+      "vpadal.u16 q1, q8\n"
+      "vpadal.u16 q2, q9\n"
+      "vpadal.u16 q3, q10\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21",
+        "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 5,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 5, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d10, d11, d12, d13}, [%[rhs]:64]!\n"
+      "vld1.32 {d14}, [%[lhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q8, d10, d14\n"
+      "vmull.u8 q9, d11, d14\n"
+      "vmull.u8 q10, d12, d14\n"
+      "vmull.u8 q11, d13, d14\n"
+      "vld1.32 {d10}, [%[rhs]:64]!\n"
+      "pld [%[rhs], #128]\n"
+      "vpadal.u16 q0, q8\n"
+      "vpadal.u16 q1, q9\n"
+      "vpadal.u16 q2, q10\n"
+      "vpadal.u16 q3, q11\n"
+      "vmull.u8 q8, d10, d14\n"
+      "vpadal.u16 q4, q8\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d10, d11}, [%[lhs]:64]!\n"
+      "vld1.32 {d12, d13, d14, d15}, [%[rhs]:64]!\n"
+      "vdup.32 q5, d10[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d8\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q1, q1, q5\n"
+      "vadd.s32 q0, q0, q6\n"
+      "vadd.s32 q1, q1, q7\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1}, [%[result]]!\n"
+      "vst1.32 {d2[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 6,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 6, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d12, d13, d14, d15}, [%[rhs]:64]!\n"
+      "vld1.32 {d16}, [%[lhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q9, d12, d16\n"
+      "vmull.u8 q10, d13, d16\n"
+      "vmull.u8 q11, d14, d16\n"
+      "vmull.u8 q12, d15, d16\n"
+      "vld1.32 {d12, d13}, [%[rhs]:64]!\n"
+      "pld [%[rhs], #128]\n"
+      "vpadal.u16 q0, q9\n"
+      "vpadal.u16 q1, q10\n"
+      "vpadal.u16 q2, q11\n"
+      "vpadal.u16 q3, q12\n"
+      "vmull.u8 q9, d12, d16\n"
+      "vmull.u8 q10, d13, d16\n"
+      "vpadal.u16 q4, q9\n"
+      "vpadal.u16 q5, q10\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15, d16, d17}, [%[rhs]:64]!\n"
+      "vdup.32 q6, d12[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d10\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q6\n"
+      "vadd.s32 q1, q1, q6\n"
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q1, q1, q8\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1, d2}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 7,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 7, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d14, d15, d16, d17}, [%[rhs]:64]!\n"
+      "vld1.32 {d18}, [%[lhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q10, d14, d18\n"
+      "vmull.u8 q11, d15, d18\n"
+      "vmull.u8 q12, d16, d18\n"
+      "vmull.u8 q13, d17, d18\n"
+      "vld1.32 {d14, d15, d16}, [%[rhs]:64]!\n"
+      "pld [%[rhs], #128]\n"
+      "vpadal.u16 q0, q10\n"
+      "vpadal.u16 q1, q11\n"
+      "vpadal.u16 q2, q12\n"
+      "vpadal.u16 q3, q13\n"
+      "vmull.u8 q10, d14, d18\n"
+      "vmull.u8 q11, d15, d18\n"
+      "vmull.u8 q12, d16, d18\n"
+      "vpadal.u16 q4, q10\n"
+      "vpadal.u16 q5, q11\n"
+      "vpadal.u16 q6, q12\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d14, d15}, [%[lhs]:64]!\n"
+      "vld1.32 {d16, d17, d18, d19}, [%[rhs]:64]!\n"
+      "vdup.32 q7, d14[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d10\n"
+      "vpadd.u32 d3, d12, d12\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q1, q1, q7\n"
+      "vadd.s32 q0, q0, q8\n"
+      "vadd.s32 q1, q1, q9\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1, d2}, [%[result]]!\n"
+      "vst1.32 {d3[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 8,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 8, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+      "vmov.i32 q7, q4\n"
+
+      // 1x8 lanes loop.
+      "1:"
+
+      "vld1.32 {d17, d18, d19, d20}, [%[rhs]:256]!\n"
+      "vld1.32 {d16}, [%[lhs]:64]!\n"
+      "vmull.u8 q11, d16, d17\n"
+      "vmull.u8 q12, d16, d18\n"
+      "vmull.u8 q13, d16, d19\n"
+      "vmull.u8 q14, d16, d20\n"
+      "vld1.32 {d17, d18, d19, d20}, [%[rhs]:256]!\n"
+      "vpadal.u16 q0, q11\n"
+      "vpadal.u16 q1, q12\n"
+      "vpadal.u16 q2, q13\n"
+      "vpadal.u16 q3, q14\n"
+      "pld [%[rhs], #256]\n"
+      "vmull.u8 q15, d16, d17\n"
+      "vmull.u8 q11, d16, d18\n"
+      "vmull.u8 q12, d16, d19\n"
+      "vmull.u8 q13, d16, d20\n"
+      "pld [%[lhs], #32]\n"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vpadal.u16 q4, q15\n"
+      "vpadal.u16 q5, q11\n"
+      "vpadal.u16 q6, q12\n"
+      "vpadal.u16 q7, q13\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d16, d17}, [%[lhs]:64]!\n"
+      "vld1.32 {d18, d19, d20, d21}, [%[rhs]:64]!\n"
+      "vdup.32 q8, d16[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d14, d14, d15\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d10\n"
+      "vpadd.u32 d3, d12, d14\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q8\n"
+      "vadd.s32 q1, q1, q8\n"
+      "vadd.s32 q0, q0, q9\n"
+      "vadd.s32 q1, q1, q10\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1, d2, d3}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
+        "d31", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d4, d5}, [%[lhs]:64]!\n"
+      "vld1.32 {d6}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q4, d6, d4\n"
+      "vmull.u8 q5, d6, d5\n"
+      "vpadal.u16 q0, q4\n"
+      "vpadal.u16 q1, q5\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q2, d8[0]\n"
+      "vdup.32 q4, d8[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d0, d0, d0\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d2, d2, d2\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q2\n"
+      "vadd.s32 q1, q1, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q1, q1, q5\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      "vst1.32 {d2[0]}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10",
+        "d11", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q6, d10, d8\n"
+      "vmull.u8 q7, d11, d8\n"
+      "vmull.u8 q8, d10, d9\n"
+      "vmull.u8 q9, d11, d9\n"
+      "vpadal.u16 q0, q6\n"
+      "vpadal.u16 q1, q7\n"
+      "vpadal.u16 q2, q8\n"
+      "vpadal.u16 q3, q9\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, d8[0]\n"
+      "vdup.32 q4, d8[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d4, d4, d6\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q6\n"
+      "vadd.s32 q2, q2, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q2, q2, q5\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d4}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15, d16}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q9, d14, d12\n"
+      "vmull.u8 q10, d15, d12\n"
+      "vmull.u8 q11, d16, d12\n"
+      "vmull.u8 q12, d14, d13\n"
+      "vmull.u8 q13, d15, d13\n"
+      "vmull.u8 q14, d16, d13\n"
+      "vpadal.u16 q0, q9\n"
+      "vpadal.u16 q1, q10\n"
+      "vpadal.u16 q2, q11\n"
+      "vpadal.u16 q3, q12\n"
+      "vpadal.u16 q4, q13\n"
+      "vpadal.u16 q5, q14\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15}, [%[rhs]:64]!\n"
+      "vdup.32 q8, d12[0]\n"
+      "vdup.32 q6, d12[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d4\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d6, d6, d8\n"
+      "vpadd.u32 d7, d10, d10\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q8\n"
+      "vadd.s32 q3, q3, q6\n"
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q3, q3, q7\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d1[0]}, [%[result]]!\n"
+      "vst1.32 {d6}, [r0]!\n"
+      "vst1.32 {d7[0]}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "cc",
+        "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 4,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 4, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+      "vmov.i32 q7, q4\n"
+
+      // 2x4 lanes loop.
+      "1:"
+
+      "vld1.8 {d18, d19, d20, d21}, [%[rhs]:256]!\n"
+      "vld1.8 {d16}, [%[lhs]:64]!\n"
+      "vmull.u8 q11, d16, d18\n"
+      "vld1.8 {d17}, [%[lhs]:64]!\n"
+      "vmull.u8 q12, d16, d19\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q13, d16, d20\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q14, d16, d21\n"
+      "vmull.u8 q15, d17, d18\n"
+      "vpadal.u16 q0, q11\n"
+      "vpadal.u16 q1, q12\n"
+      "vpadal.u16 q2, q13\n"
+      "vmull.u8 q11, d17, d19\n"
+      "vmull.u8 q12, d17, d20\n"
+      "vmull.u8 q13, d17, d21\n"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vpadal.u16 q3, q14\n"
+      "vpadal.u16 q4, q15\n"
+      "vpadal.u16 q5, q11\n"
+      "vpadal.u16 q6, q12\n"
+      "vpadal.u16 q7, q13\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d16, d17}, [%[lhs]:64]!\n"
+      "vld1.32 {d18, d19}, [%[rhs]:64]!\n"
+      "vdup.32 q10, d16[0]\n"
+      "vdup.32 q8, d16[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d14, d14, d15\n"
+      "vpadd.u32 d8, d8, d10\n"
+      "vpadd.u32 d9, d12, d14\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q10\n"
+      "vadd.s32 q4, q4, q8\n"
+      "vadd.s32 q0, q0, q9\n"
+      "vadd.s32 q4, q4, q9\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1}, [%[result]]!\n"
+      "vst1.32 {d8, d9}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
+        "d31", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d6, d7, d8}, [%[lhs]:64]!\n"
+      "vld1.32 {d9}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q5, d9, d6\n"
+      "vmull.u8 q6, d9, d7\n"
+      "vmull.u8 q7, d9, d8\n"
+      "vpadal.u16 q0, q5\n"
+      "vpadal.u16 q1, q6\n"
+      "vpadal.u16 q2, q7\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q3, d8[0]\n"
+      "vdup.32 q6, d8[1]\n"
+      "vdup.32 q4, d9[0]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d0, d0, d0\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d2, d2, d2\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d4, d4, d4\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q3\n"
+      "vadd.s32 q1, q1, q6\n"
+      "vadd.s32 q2, q2, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q1, q1, q5\n"
+      "vadd.s32 q2, q2, q5\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      "vst1.32 {d2[0]}, [r0]!\n"
+      "vst1.32 {d4[0]}, [r1]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d12, d13, d14}, [%[lhs]:64]!\n"
+      "vld1.32 {d15, d16}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q9, d15, d12\n"
+      "vmull.u8 q10, d16, d12\n"
+      "vmull.u8 q11, d15, d13\n"
+      "vmull.u8 q12, d16, d13\n"
+      "vmull.u8 q13, d15, d14\n"
+      "vmull.u8 q14, d16, d14\n"
+      "vpadal.u16 q0, q9\n"
+      "vpadal.u16 q1, q10\n"
+      "vpadal.u16 q2, q11\n"
+      "vpadal.u16 q3, q12\n"
+      "vpadal.u16 q4, q13\n"
+      "vpadal.u16 q5, q14\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15}, [%[rhs]:64]!\n"
+      "vdup.32 q8, d12[0]\n"
+      "vdup.32 q9, d12[1]\n"
+      "vdup.32 q6, d13[0]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d4, d4, d6\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d8, d8, d10\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q8\n"
+      "vadd.s32 q2, q2, q9\n"
+      "vadd.s32 q4, q4, q6\n"
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q2, q2, q7\n"
+      "vadd.s32 q4, q4, q7\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d4}, [r0]!\n"
+      "vst1.32 {d8}, [r1]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+      "vmov.i32 q7, q4\n"
+      "vmov.i32 q8, q5\n"
+
+      // 3x3 lanes loop.
+      "1:"
+
+      "vld1.8 {d21, d22, d23}, [%[rhs]:64]!\n"
+      "vld1.8 {d18}, [%[lhs]:64]!\n"
+      "vmull.u8 q12, d18, d21\n"
+      "vld1.8 {d19}, [%[lhs]:64]!\n"
+      "vmull.u8 q13, d18, d22\n"
+      "vld1.8 {d20}, [%[lhs]:64]!\n"
+      "vmull.u8 q14, d18, d23\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q15, d19, d21\n"
+      "pld [%[rhs], #64]\n"
+      "vpadal.u16 q0, q12\n"
+      "vpadal.u16 q1, q13\n"
+      "vpadal.u16 q2, q14\n"
+      "vpadal.u16 q3, q15\n"
+      "vmull.u8 q12, d19, d22\n"
+      "vmull.u8 q13, d19, d23\n"
+      "vmull.u8 q14, d20, d21\n"
+      "vmull.u8 q15, d20, d22\n"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vmull.u8 q9, d20, d23\n"
+      "vpadal.u16 q4, q12\n"
+      "vpadal.u16 q5, q13\n"
+      "vpadal.u16 q6, q14\n"
+      "vpadal.u16 q7, q15\n"
+      "vpadal.u16 q8, q9\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "vld1.32 {d18, d19}, [%[lhs]:64]!\n"
+      "vld1.32 {d20, d21}, [%[rhs]:64]!\n"
+      "vdup.32 q11, d18[0]\n"
+      "vdup.32 q12, d18[1]\n"
+      "vdup.32 q9, d19[0]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d4\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d6, d6, d8\n"
+      "vpadd.u32 d7, d10, d10\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d14, d14, d15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d12, d12, d14\n"
+      "vpadd.u32 d13, d16, d16\n"
+
+      // StaticQuantizationInt32::Transform
+      "vadd.s32 q0, q0, q11\n"
+      "vadd.s32 q3, q3, q12\n"
+      "vadd.s32 q6, q6, q9\n"
+      "vadd.s32 q0, q0, q10\n"
+      "vadd.s32 q3, q3, q10\n"
+      "vadd.s32 q6, q6, q10\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d1[0]}, [%[result]]!\n"
+      "vst1.32 {d6}, [r0]!\n"
+      "vst1.32 {d7[0]}, [r0]!\n"
+      "vst1.32 {d12}, [r1]!\n"
+      "vst1.32 {d13[0]}, [r1]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "d30", "d31", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d2}, [%[lhs]:64]!\n"
+      "vld1.32 {d3}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q2, d3, d2\n"
+      "vpadal.u16 q0, q2\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[scale]\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d0, d0, d0\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vmul.f32 q0, q0, q6\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d8", "d9", "d10", "d11", "d12",
+        "d13", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d4}, [%[lhs]:64]!\n"
+      "vld1.32 {d5, d6}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q4, d5, d4\n"
+      "vmull.u8 q5, d6, d4\n"
+      "vpadal.u16 q0, q4\n"
+      "vpadal.u16 q1, q5\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[scale]\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d0, d0, d2\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vmul.f32 q0, q0, q6\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10", "d11",
+        "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d6}, [%[lhs]:64]!\n"
+      "vld1.32 {d7, d8, d9}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q5, d7, d6\n"
+      "vmull.u8 q6, d8, d6\n"
+      "vmull.u8 q7, d9, d6\n"
+      "vpadal.u16 q0, q5\n"
+      "vpadal.u16 q1, q6\n"
+      "vpadal.u16 q2, q7\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[scale]\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d4\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vmul.f32 q0, q0, q6\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d1[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 4,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 4, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d8}, [%[lhs]:64]!\n"
+      "vld1.32 {d9, d10, d11, d12}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q7, d9, d8\n"
+      "vmull.u8 q8, d10, d8\n"
+      "vmull.u8 q9, d11, d8\n"
+      "vmull.u8 q10, d12, d8\n"
+      "vpadal.u16 q0, q7\n"
+      "vpadal.u16 q1, q8\n"
+      "vpadal.u16 q2, q9\n"
+      "vpadal.u16 q3, q10\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[scale]\n"
+      "vdup.32 q4, d8[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vmul.f32 q0, q0, q6\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 5,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 5, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d10, d11, d12, d13}, [%[rhs]:64]!\n"
+      "vld1.32 {d14}, [%[lhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q8, d10, d14\n"
+      "vmull.u8 q9, d11, d14\n"
+      "vmull.u8 q10, d12, d14\n"
+      "vmull.u8 q11, d13, d14\n"
+      "vld1.32 {d10}, [%[rhs]:64]!\n"
+      "pld [%[rhs], #128]\n"
+      "vpadal.u16 q0, q8\n"
+      "vpadal.u16 q1, q9\n"
+      "vpadal.u16 q2, q10\n"
+      "vpadal.u16 q3, q11\n"
+      "vmull.u8 q8, d10, d14\n"
+      "vpadal.u16 q4, q8\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d10, d11}, [%[lhs]:64]!\n"
+      "vld1.32 {d12, d13, d14, d15}, [%[rhs]:64]!\n"
+      "vdup.32 q8, %[scale]\n"
+      "vdup.32 q5, d10[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d8\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q1, q1, q5\n"
+      "vadd.s32 q0, q0, q6\n"
+      "vadd.s32 q1, q1, q7\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1}, [%[result]]!\n"
+      "vst1.32 {d2[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 6,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 6, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d12, d13, d14, d15}, [%[rhs]:64]!\n"
+      "vld1.32 {d16}, [%[lhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q9, d12, d16\n"
+      "vmull.u8 q10, d13, d16\n"
+      "vmull.u8 q11, d14, d16\n"
+      "vmull.u8 q12, d15, d16\n"
+      "vld1.32 {d12, d13}, [%[rhs]:64]!\n"
+      "pld [%[rhs], #128]\n"
+      "vpadal.u16 q0, q9\n"
+      "vpadal.u16 q1, q10\n"
+      "vpadal.u16 q2, q11\n"
+      "vpadal.u16 q3, q12\n"
+      "vmull.u8 q9, d12, d16\n"
+      "vmull.u8 q10, d13, d16\n"
+      "vpadal.u16 q4, q9\n"
+      "vpadal.u16 q5, q10\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15, d16, d17}, [%[rhs]:64]!\n"
+      "vdup.32 q9, %[scale]\n"
+      "vdup.32 q6, d12[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d10\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q6\n"
+      "vadd.s32 q1, q1, q6\n"
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q1, q1, q8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1, d2}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 7,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 7, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d14, d15, d16, d17}, [%[rhs]:64]!\n"
+      "vld1.32 {d18}, [%[lhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q10, d14, d18\n"
+      "vmull.u8 q11, d15, d18\n"
+      "vmull.u8 q12, d16, d18\n"
+      "vmull.u8 q13, d17, d18\n"
+      "vld1.32 {d14, d15, d16}, [%[rhs]:64]!\n"
+      "pld [%[rhs], #128]\n"
+      "vpadal.u16 q0, q10\n"
+      "vpadal.u16 q1, q11\n"
+      "vpadal.u16 q2, q12\n"
+      "vpadal.u16 q3, q13\n"
+      "vmull.u8 q10, d14, d18\n"
+      "vmull.u8 q11, d15, d18\n"
+      "vmull.u8 q12, d16, d18\n"
+      "vpadal.u16 q4, q10\n"
+      "vpadal.u16 q5, q11\n"
+      "vpadal.u16 q6, q12\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d14, d15}, [%[lhs]:64]!\n"
+      "vld1.32 {d16, d17, d18, d19}, [%[rhs]:64]!\n"
+      "vdup.32 q10, %[scale]\n"
+      "vdup.32 q7, d14[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d10\n"
+      "vpadd.u32 d3, d12, d12\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q1, q1, q7\n"
+      "vadd.s32 q0, q0, q8\n"
+      "vadd.s32 q1, q1, q9\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vmul.f32 q0, q0, q10\n"
+      "vmul.f32 q1, q1, q10\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1, d2}, [%[result]]!\n"
+      "vst1.32 {d3[0]}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 8,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 8, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+      "vmov.i32 q7, q4\n"
+
+      // 1x8 lanes loop.
+      "1:"
+
+      "vld1.32 {d17, d18, d19, d20}, [%[rhs]:256]!\n"
+      "vld1.32 {d16}, [%[lhs]:64]!\n"
+      "vmull.u8 q11, d16, d17\n"
+      "vmull.u8 q12, d16, d18\n"
+      "vmull.u8 q13, d16, d19\n"
+      "vmull.u8 q14, d16, d20\n"
+      "vld1.32 {d17, d18, d19, d20}, [%[rhs]:256]!\n"
+      "vpadal.u16 q0, q11\n"
+      "vpadal.u16 q1, q12\n"
+      "vpadal.u16 q2, q13\n"
+      "vpadal.u16 q3, q14\n"
+      "pld [%[rhs], #256]\n"
+      "vmull.u8 q15, d16, d17\n"
+      "vmull.u8 q11, d16, d18\n"
+      "vmull.u8 q12, d16, d19\n"
+      "vmull.u8 q13, d16, d20\n"
+      "pld [%[lhs], #32]\n"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vpadal.u16 q4, q15\n"
+      "vpadal.u16 q5, q11\n"
+      "vpadal.u16 q6, q12\n"
+      "vpadal.u16 q7, q13\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d16, d17}, [%[lhs]:64]!\n"
+      "vld1.32 {d18, d19, d20, d21}, [%[rhs]:64]!\n"
+      "vdup.32 q11, %[scale]\n"
+      "vdup.32 q8, d16[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d14, d14, d15\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d2, d8, d10\n"
+      "vpadd.u32 d3, d12, d14\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q8\n"
+      "vadd.s32 q1, q1, q8\n"
+      "vadd.s32 q0, q0, q9\n"
+      "vadd.s32 q1, q1, q10\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vmul.f32 q0, q0, q11\n"
+      "vmul.f32 q1, q1, q11\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1, d2, d3}, [%[result]]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
+        "d31", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d4, d5}, [%[lhs]:64]!\n"
+      "vld1.32 {d6}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q4, d6, d4\n"
+      "vmull.u8 q5, d6, d5\n"
+      "vpadal.u16 q0, q4\n"
+      "vpadal.u16 q1, q5\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[scale]\n"
+      "vdup.32 q2, d8[0]\n"
+      "vdup.32 q4, d8[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d0, d0, d0\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d2, d2, d2\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q2\n"
+      "vadd.s32 q1, q1, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q1, q1, q5\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      "vst1.32 {d2[0]}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q6, d10, d8\n"
+      "vmull.u8 q7, d11, d8\n"
+      "vmull.u8 q8, d10, d9\n"
+      "vmull.u8 q9, d11, d9\n"
+      "vpadal.u16 q0, q6\n"
+      "vpadal.u16 q1, q7\n"
+      "vpadal.u16 q2, q8\n"
+      "vpadal.u16 q3, q9\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[scale]\n"
+      "vdup.32 q7, d8[0]\n"
+      "vdup.32 q4, d8[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d4, d4, d6\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q2, q2, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q2, q2, q5\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d4}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15, d16}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q9, d14, d12\n"
+      "vmull.u8 q10, d15, d12\n"
+      "vmull.u8 q11, d16, d12\n"
+      "vmull.u8 q12, d14, d13\n"
+      "vmull.u8 q13, d15, d13\n"
+      "vmull.u8 q14, d16, d13\n"
+      "vpadal.u16 q0, q9\n"
+      "vpadal.u16 q1, q10\n"
+      "vpadal.u16 q2, q11\n"
+      "vpadal.u16 q3, q12\n"
+      "vpadal.u16 q4, q13\n"
+      "vpadal.u16 q5, q14\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15}, [%[rhs]:64]!\n"
+      "vdup.32 q8, %[scale]\n"
+      "vdup.32 q9, d12[0]\n"
+      "vdup.32 q6, d12[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d4\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d6, d6, d8\n"
+      "vpadd.u32 d7, d10, d10\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q9\n"
+      "vadd.s32 q3, q3, q6\n"
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q3, q3, q7\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d1[0]}, [%[result]]!\n"
+      "vst1.32 {d6}, [r0]!\n"
+      "vst1.32 {d7[0]}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "cc",
+        "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 4,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 4, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+      "vmov.i32 q7, q4\n"
+
+      // 2x4 lanes loop.
+      "1:"
+
+      "vld1.8 {d18, d19, d20, d21}, [%[rhs]:256]!\n"
+      "vld1.8 {d16}, [%[lhs]:64]!\n"
+      "vmull.u8 q11, d16, d18\n"
+      "vld1.8 {d17}, [%[lhs]:64]!\n"
+      "vmull.u8 q12, d16, d19\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q13, d16, d20\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q14, d16, d21\n"
+      "vmull.u8 q15, d17, d18\n"
+      "vpadal.u16 q0, q11\n"
+      "vpadal.u16 q1, q12\n"
+      "vpadal.u16 q2, q13\n"
+      "vmull.u8 q11, d17, d19\n"
+      "vmull.u8 q12, d17, d20\n"
+      "vmull.u8 q13, d17, d21\n"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vpadal.u16 q3, q14\n"
+      "vpadal.u16 q4, q15\n"
+      "vpadal.u16 q5, q11\n"
+      "vpadal.u16 q6, q12\n"
+      "vpadal.u16 q7, q13\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d16, d17}, [%[lhs]:64]!\n"
+      "vld1.32 {d18, d19}, [%[rhs]:64]!\n"
+      "vdup.32 q10, %[scale]\n"
+      "vdup.32 q11, d16[0]\n"
+      "vdup.32 q8, d16[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d6\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d14, d14, d15\n"
+      "vpadd.u32 d8, d8, d10\n"
+      "vpadd.u32 d9, d12, d14\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q11\n"
+      "vadd.s32 q4, q4, q8\n"
+      "vadd.s32 q0, q0, q9\n"
+      "vadd.s32 q4, q4, q9\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vmul.f32 q0, q0, q10\n"
+      "vmul.f32 q4, q4, q10\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0, d1}, [%[result]]!\n"
+      "vst1.32 {d8, d9}, [r0]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
+        "d31", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d6, d7, d8}, [%[lhs]:64]!\n"
+      "vld1.32 {d9}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q5, d9, d6\n"
+      "vmull.u8 q6, d9, d7\n"
+      "vmull.u8 q7, d9, d8\n"
+      "vpadal.u16 q0, q5\n"
+      "vpadal.u16 q1, q6\n"
+      "vpadal.u16 q2, q7\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d8, d9}, [%[lhs]:64]!\n"
+      "vld1.32 {d10, d11}, [%[rhs]:64]!\n"
+      "vdup.32 q6, %[scale]\n"
+      "vdup.32 q3, d8[0]\n"
+      "vdup.32 q7, d8[1]\n"
+      "vdup.32 q4, d9[0]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d0, d0, d0\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d2, d2, d2\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d4, d4, d4\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q3\n"
+      "vadd.s32 q1, q1, q7\n"
+      "vadd.s32 q2, q2, q4\n"
+      "vadd.s32 q0, q0, q5\n"
+      "vadd.s32 q1, q1, q5\n"
+      "vadd.s32 q2, q2, q5\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0[0]}, [%[result]]!\n"
+      "vst1.32 {d2[0]}, [r0]!\n"
+      "vst1.32 {d4[0]}, [r1]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vld1.32 {d12, d13, d14}, [%[lhs]:64]!\n"
+      "vld1.32 {d15, d16}, [%[rhs]:64]!\n"
+      "pld [%[lhs], #64]\n"
+      "pld [%[rhs], #64]\n"
+      "vmull.u8 q9, d15, d12\n"
+      "vmull.u8 q10, d16, d12\n"
+      "vmull.u8 q11, d15, d13\n"
+      "vmull.u8 q12, d16, d13\n"
+      "vmull.u8 q13, d15, d14\n"
+      "vmull.u8 q14, d16, d14\n"
+      "vpadal.u16 q0, q9\n"
+      "vpadal.u16 q1, q10\n"
+      "vpadal.u16 q2, q11\n"
+      "vpadal.u16 q3, q12\n"
+      "vpadal.u16 q4, q13\n"
+      "vpadal.u16 q5, q14\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d12, d13}, [%[lhs]:64]!\n"
+      "vld1.32 {d14, d15}, [%[rhs]:64]!\n"
+      "vdup.32 q8, %[scale]\n"
+      "vdup.32 q9, d12[0]\n"
+      "vdup.32 q10, d12[1]\n"
+      "vdup.32 q6, d13[0]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d4, d4, d6\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d8, d8, d10\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q9\n"
+      "vadd.s32 q2, q2, q10\n"
+      "vadd.s32 q4, q4, q6\n"
+      "vadd.s32 q0, q0, q7\n"
+      "vadd.s32 q2, q2, q7\n"
+      "vadd.s32 q4, q4, q7\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q4, q4, q8\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d4}, [r0]!\n"
+      "vst1.32 {d8}, [r1]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "pld [%[lhs]]\n"
+      "pld [%[rhs]]\n"
+
+      // Clear aggregators.
+      "vmov.i32 q0, #0\n"
+      "vmov.i32 q1, #0\n"
+      "vmov.i32 q2, #0\n"
+      "vmov.i32 q3, q0\n"
+      "vmov.i32 q4, q1\n"
+      "vmov.i32 q5, q2\n"
+      "vmov.i32 q6, q3\n"
+      "vmov.i32 q7, q4\n"
+      "vmov.i32 q8, q5\n"
+
+      // 3x3 lanes loop.
+      "1:"
+
+      "vld1.8 {d21, d22, d23}, [%[rhs]:64]!\n"
+      "vld1.8 {d18}, [%[lhs]:64]!\n"
+      "vmull.u8 q12, d18, d21\n"
+      "vld1.8 {d19}, [%[lhs]:64]!\n"
+      "vmull.u8 q13, d18, d22\n"
+      "vld1.8 {d20}, [%[lhs]:64]!\n"
+      "vmull.u8 q14, d18, d23\n"
+      "pld [%[lhs], #64]\n"
+      "vmull.u8 q15, d19, d21\n"
+      "pld [%[rhs], #64]\n"
+      "vpadal.u16 q0, q12\n"
+      "vpadal.u16 q1, q13\n"
+      "vpadal.u16 q2, q14\n"
+      "vpadal.u16 q3, q15\n"
+      "vmull.u8 q12, d19, d22\n"
+      "vmull.u8 q13, d19, d23\n"
+      "vmull.u8 q14, d20, d21\n"
+      "vmull.u8 q15, d20, d22\n"
+
+      // Subtract counter.
+      "subs %[count], %[count], #8\n"
+
+      "vmull.u8 q9, d20, d23\n"
+      "vpadal.u16 q4, q12\n"
+      "vpadal.u16 q5, q13\n"
+      "vpadal.u16 q6, q14\n"
+      "vpadal.u16 q7, q15\n"
+      "vpadal.u16 q8, q9\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "vld1.32 {d18, d19}, [%[lhs]:64]!\n"
+      "vld1.32 {d20, d21}, [%[rhs]:64]!\n"
+      "vdup.32 q11, %[scale]\n"
+      "vdup.32 q12, d18[0]\n"
+      "vdup.32 q13, d18[1]\n"
+      "vdup.32 q9, d19[0]\n"
+
+      // RowMajorOutput::Prepare
+      "add r0, %[result], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+
+      // Reduce aggregators.
+      "vpadd.u32 d0, d0, d1\n"
+      "vpadd.u32 d2, d2, d3\n"
+      "vpadd.u32 d4, d4, d5\n"
+      "vpadd.u32 d0, d0, d2\n"
+      "vpadd.u32 d1, d4, d4\n"
+      "vpadd.u32 d6, d6, d7\n"
+      "vpadd.u32 d8, d8, d9\n"
+      "vpadd.u32 d10, d10, d11\n"
+      "vpadd.u32 d6, d6, d8\n"
+      "vpadd.u32 d7, d10, d10\n"
+      "vpadd.u32 d12, d12, d13\n"
+      "vpadd.u32 d14, d14, d15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d12, d12, d14\n"
+      "vpadd.u32 d13, d16, d16\n"
+
+      // StaticQuantizationFloat::Transform
+      "vadd.s32 q0, q0, q12\n"
+      "vadd.s32 q3, q3, q13\n"
+      "vadd.s32 q6, q6, q9\n"
+      "vadd.s32 q0, q0, q10\n"
+      "vadd.s32 q3, q3, q10\n"
+      "vadd.s32 q6, q6, q10\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vmul.f32 q0, q0, q11\n"
+      "vmul.f32 q3, q3, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+
+      // RowMajorOutput::Output
+      "vst1.32 {d0}, [%[result]]!\n"
+      "vst1.32 {d1[0]}, [%[result]]!\n"
+      "vst1.32 {d6}, [r0]!\n"
+      "vst1.32 {d7[0]}, [r0]!\n"
+      "vst1.32 {d12}, [r1]!\n"
+      "vst1.32 {d13[0]}, [r1]!\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "d30", "d31", "cc", "memory");
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm for arm32 requires: GEMMLOWP_NEON_32!"
+#endif
+
+#endif  // GEMMLOWP_META_QUANTIZED_MUL_KERNELS_ARM_32_H_
diff --git a/meta/quantized_mul_kernels_arm_64.h b/meta/quantized_mul_kernels_arm_64.h
new file mode 100644
index 0000000..abda496
--- /dev/null
+++ b/meta/quantized_mul_kernels_arm_64.h
@@ -0,0 +1,4106 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_QUANTIZED_MUL_KERNELS_ARM_64_H_
+#define GEMMLOWP_META_QUANTIZED_MUL_KERNELS_ARM_64_H_
+
+#ifdef GEMMLOWP_NEON_64
+
+#include <cassert>
+#include <cstdint>
+
+namespace gemmlowp {
+namespace meta {
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 1,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 1, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v1.2s}, [%x[lhs]], #8\n"
+      "ld1 {v2.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v3.8h, v2.8b, v1.8b\n"
+      "uadalp v0.4s, v3.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[multiplicative_offset]\n"
+      "dup v7.4s, %w[rounding_offset]\n"
+      "dup v8.4s, %w[shift]\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "mul v0.4s, v0.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "sshl v0.4s, v0.4s, v8.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.b}[0], [%x[result]], #1\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 2,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 2, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v2.2s}, [%x[lhs]], #8\n"
+      "ld1 {v3.2s, v4.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v5.8h, v3.8b, v2.8b\n"
+      "umull v6.8h, v4.8b, v2.8b\n"
+      "uadalp v0.4s, v5.8h\n"
+      "uadalp v1.4s, v6.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[multiplicative_offset]\n"
+      "dup v7.4s, %w[rounding_offset]\n"
+      "dup v8.4s, %w[shift]\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "mul v0.4s, v0.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "sshl v0.4s, v0.4s, v8.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.h}[0], [%x[result]], #2\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 3,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 3, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v3.2s}, [%x[lhs]], #8\n"
+      "ld1 {v4.2s, v5.2s, v6.2s}, [%x[rhs]], #24\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v7.8h, v4.8b, v3.8b\n"
+      "umull v8.8h, v5.8b, v3.8b\n"
+      "umull v9.8h, v6.8b, v3.8b\n"
+      "uadalp v0.4s, v7.8h\n"
+      "uadalp v1.4s, v8.8h\n"
+      "uadalp v2.4s, v9.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[multiplicative_offset]\n"
+      "dup v7.4s, %w[rounding_offset]\n"
+      "dup v8.4s, %w[shift]\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "mul v0.4s, v0.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "sshl v0.4s, v0.4s, v8.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.h}[0], [%x[result]], #2\n"
+      "st1 {v0.b}[2], [%x[result]], #1\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "cc",
+        "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 4,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 4, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v4.2s}, [%x[lhs]], #8\n"
+      "ld1 {v5.2s, v6.2s, v7.2s, v8.2s}, [%x[rhs]], #32\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v9.8h, v5.8b, v4.8b\n"
+      "umull v10.8h, v6.8b, v4.8b\n"
+      "umull v11.8h, v7.8b, v4.8b\n"
+      "umull v12.8h, v8.8b, v4.8b\n"
+      "uadalp v0.4s, v9.8h\n"
+      "uadalp v1.4s, v10.8h\n"
+      "uadalp v2.4s, v11.8h\n"
+      "uadalp v3.4s, v12.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[multiplicative_offset]\n"
+      "dup v7.4s, %w[rounding_offset]\n"
+      "dup v8.4s, %w[shift]\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "mul v0.4s, v0.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "sshl v0.4s, v0.4s, v8.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 5,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 5, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v5.2s, v6.2s, v7.2s, v8.2s}, [%x[rhs]], #32\n"
+      "ld1 {v9.2s}, [%x[lhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v10.8h, v5.8b, v9.8b\n"
+      "umull v11.8h, v6.8b, v9.8b\n"
+      "umull v12.8h, v7.8b, v9.8b\n"
+      "umull v13.8h, v8.8b, v9.8b\n"
+      "ld1 {v5.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[rhs], #128]\n"
+      "uadalp v0.4s, v10.8h\n"
+      "uadalp v1.4s, v11.8h\n"
+      "uadalp v2.4s, v12.8h\n"
+      "uadalp v3.4s, v13.8h\n"
+      "umull v10.8h, v5.8b, v9.8b\n"
+      "uadalp v4.4s, v10.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v5.4s}, [%x[lhs]], #16\n"
+      "ld1 {v6.4s, v7.4s}, [%x[rhs]], #32\n"
+      "dup v8.4s, %w[multiplicative_offset]\n"
+      "dup v9.4s, %w[rounding_offset]\n"
+      "dup v10.4s, %w[shift]\n"
+      "dup v5.4s, v5.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v4.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v4.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v1.4s, v1.4s, v5.4s\n"
+      "add v0.4s, v0.4s, v6.4s\n"
+      "add v1.4s, v1.4s, v7.4s\n"
+      "mul v0.4s, v0.4s, v8.4s\n"
+      "mul v1.4s, v1.4s, v8.4s\n"
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v1.4s, v1.4s, v9.4s\n"
+      "sshl v0.4s, v0.4s, v10.4s\n"
+      "sshl v1.4s, v1.4s, v10.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      "st1 {v0.b}[4], [%x[result]], #1\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 6,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 6, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v6.2s, v7.2s, v8.2s, v9.2s}, [%x[rhs]], #32\n"
+      "ld1 {v10.2s}, [%x[lhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v11.8h, v6.8b, v10.8b\n"
+      "umull v12.8h, v7.8b, v10.8b\n"
+      "umull v13.8h, v8.8b, v10.8b\n"
+      "umull v14.8h, v9.8b, v10.8b\n"
+      "ld1 {v6.2s, v7.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[rhs], #128]\n"
+      "uadalp v0.4s, v11.8h\n"
+      "uadalp v1.4s, v12.8h\n"
+      "uadalp v2.4s, v13.8h\n"
+      "uadalp v3.4s, v14.8h\n"
+      "umull v11.8h, v6.8b, v10.8b\n"
+      "umull v12.8h, v7.8b, v10.8b\n"
+      "uadalp v4.4s, v11.8h\n"
+      "uadalp v5.4s, v12.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v6.4s}, [%x[lhs]], #16\n"
+      "ld1 {v7.4s, v8.4s}, [%x[rhs]], #32\n"
+      "dup v9.4s, %w[multiplicative_offset]\n"
+      "dup v10.4s, %w[rounding_offset]\n"
+      "dup v11.4s, %w[shift]\n"
+      "dup v6.4s, v6.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v4.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v6.4s\n"
+      "add v1.4s, v1.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v1.4s, v1.4s, v8.4s\n"
+      "mul v0.4s, v0.4s, v9.4s\n"
+      "mul v1.4s, v1.4s, v9.4s\n"
+      "add v0.4s, v0.4s, v10.4s\n"
+      "add v1.4s, v1.4s, v10.4s\n"
+      "sshl v0.4s, v0.4s, v11.4s\n"
+      "sshl v1.4s, v1.4s, v11.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      "st1 {v0.h}[2], [%x[result]], #2\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 7,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 7, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v7.2s, v8.2s, v9.2s, v10.2s}, [%x[rhs]], #32\n"
+      "ld1 {v11.2s}, [%x[lhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v12.8h, v7.8b, v11.8b\n"
+      "umull v13.8h, v8.8b, v11.8b\n"
+      "umull v14.8h, v9.8b, v11.8b\n"
+      "umull v15.8h, v10.8b, v11.8b\n"
+      "ld1 {v7.2s, v8.2s, v9.2s}, [%x[rhs]], #24\n"
+      "prfm pldl1keep, [%x[rhs], #128]\n"
+      "uadalp v0.4s, v12.8h\n"
+      "uadalp v1.4s, v13.8h\n"
+      "uadalp v2.4s, v14.8h\n"
+      "uadalp v3.4s, v15.8h\n"
+      "umull v12.8h, v7.8b, v11.8b\n"
+      "umull v13.8h, v8.8b, v11.8b\n"
+      "umull v14.8h, v9.8b, v11.8b\n"
+      "uadalp v4.4s, v12.8h\n"
+      "uadalp v5.4s, v13.8h\n"
+      "uadalp v6.4s, v14.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v7.4s}, [%x[lhs]], #16\n"
+      "ld1 {v8.4s, v9.4s}, [%x[rhs]], #32\n"
+      "dup v10.4s, %w[multiplicative_offset]\n"
+      "dup v11.4s, %w[rounding_offset]\n"
+      "dup v12.4s, %w[shift]\n"
+      "dup v7.4s, v7.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v6.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v6.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v1.4s, v1.4s, v7.4s\n"
+      "add v0.4s, v0.4s, v8.4s\n"
+      "add v1.4s, v1.4s, v9.4s\n"
+      "mul v0.4s, v0.4s, v10.4s\n"
+      "mul v1.4s, v1.4s, v10.4s\n"
+      "add v0.4s, v0.4s, v11.4s\n"
+      "add v1.4s, v1.4s, v11.4s\n"
+      "sshl v0.4s, v0.4s, v12.4s\n"
+      "sshl v1.4s, v1.4s, v12.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      "st1 {v0.h}[2], [%x[result]], #2\n"
+      "st1 {v0.b}[6], [%x[result]], #1\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 1, 8,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 1, 8, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+      "mov v7.16b, v4.16b\n"
+
+      // 1x8 lanes loop.
+      "1:"
+
+      "ld1 {v9.2s, v10.2s, v11.2s, v12.2s}, [%x[rhs]], #32\n"
+      "ld1 {v8.2s}, [%x[lhs]], #8\n"
+      "umull v13.8h, v8.8b, v9.8b\n"
+      "umull v14.8h, v8.8b, v10.8b\n"
+      "umull v15.8h, v8.8b, v11.8b\n"
+      "umull v16.8h, v8.8b, v12.8b\n"
+      "ld1 {v9.2s, v10.2s, v11.2s, v12.2s}, [%x[rhs]], #32\n"
+      "uadalp v0.4s, v13.8h\n"
+      "uadalp v1.4s, v14.8h\n"
+      "uadalp v2.4s, v15.8h\n"
+      "uadalp v3.4s, v16.8h\n"
+      "prfm pldl1keep, [%x[rhs], #256]\n"
+      "umull v17.8h, v8.8b, v9.8b\n"
+      "umull v13.8h, v8.8b, v10.8b\n"
+      "umull v14.8h, v8.8b, v11.8b\n"
+      "umull v15.8h, v8.8b, v12.8b\n"
+      "prfm pldl1keep, [%x[lhs], #32]\n"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "uadalp v4.4s, v17.8h\n"
+      "uadalp v5.4s, v13.8h\n"
+      "uadalp v6.4s, v14.8h\n"
+      "uadalp v7.4s, v15.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v8.4s}, [%x[lhs]], #16\n"
+      "ld1 {v9.4s, v10.4s}, [%x[rhs]], #32\n"
+      "dup v11.4s, %w[multiplicative_offset]\n"
+      "dup v12.4s, %w[rounding_offset]\n"
+      "dup v13.4s, %w[shift]\n"
+      "dup v8.4s, v8.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v7.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v6.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v8.4s\n"
+      "add v1.4s, v1.4s, v8.4s\n"
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v1.4s, v1.4s, v10.4s\n"
+      "mul v0.4s, v0.4s, v11.4s\n"
+      "mul v1.4s, v1.4s, v11.4s\n"
+      "add v0.4s, v0.4s, v12.4s\n"
+      "add v1.4s, v1.4s, v12.4s\n"
+      "sshl v0.4s, v0.4s, v13.4s\n"
+      "sshl v1.4s, v1.4s, v13.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 2, 1,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 2, 1, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v2.2s, v3.2s}, [%x[lhs]], #16\n"
+      "ld1 {v4.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v5.8h, v4.8b, v2.8b\n"
+      "umull v6.8h, v4.8b, v3.8b\n"
+      "uadalp v0.4s, v5.8h\n"
+      "uadalp v1.4s, v6.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[multiplicative_offset]\n"
+      "dup v7.4s, %w[rounding_offset]\n"
+      "dup v8.4s, %w[shift]\n"
+      "dup v2.4s, v4.s[0]\n"
+      "dup v4.4s, v4.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v2.4s\n"
+      "add v1.4s, v1.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v1.4s, v1.4s, v5.4s\n"
+      "mul v0.4s, v0.4s, v6.4s\n"
+      "mul v1.4s, v1.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v1.4s, v1.4s, v7.4s\n"
+      "sshl v0.4s, v0.4s, v8.4s\n"
+      "sshl v1.4s, v1.4s, v8.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn v1.4h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun v1.8b, v1.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.b}[0], [%x[result]], #1\n"
+      "st1 {v1.b}[0], [x0], #1\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc",
+        "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 2, 2,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 2, 2, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v4.2s, v5.2s}, [%x[lhs]], #16\n"
+      "ld1 {v6.2s, v7.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v8.8h, v6.8b, v4.8b\n"
+      "umull v9.8h, v7.8b, v4.8b\n"
+      "umull v10.8h, v6.8b, v5.8b\n"
+      "umull v11.8h, v7.8b, v5.8b\n"
+      "uadalp v0.4s, v8.8h\n"
+      "uadalp v1.4s, v9.8h\n"
+      "uadalp v2.4s, v10.8h\n"
+      "uadalp v3.4s, v11.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[multiplicative_offset]\n"
+      "dup v7.4s, %w[rounding_offset]\n"
+      "dup v8.4s, %w[shift]\n"
+      "dup v9.4s, v4.s[0]\n"
+      "dup v4.4s, v4.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v2.4s, v2.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v2.4s, v2.4s, v5.4s\n"
+      "mul v0.4s, v0.4s, v6.4s\n"
+      "mul v2.4s, v2.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v2.4s, v2.4s, v7.4s\n"
+      "sshl v0.4s, v0.4s, v8.4s\n"
+      "sshl v2.4s, v2.4s, v8.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun v2.8b, v2.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.h}[0], [%x[result]], #2\n"
+      "st1 {v2.h}[0], [x0], #2\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 2, 3,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 2, 3, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v6.2s, v7.2s}, [%x[lhs]], #16\n"
+      "ld1 {v8.2s, v9.2s, v10.2s}, [%x[rhs]], #24\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v11.8h, v8.8b, v6.8b\n"
+      "umull v12.8h, v9.8b, v6.8b\n"
+      "umull v13.8h, v10.8b, v6.8b\n"
+      "umull v14.8h, v8.8b, v7.8b\n"
+      "umull v15.8h, v9.8b, v7.8b\n"
+      "umull v16.8h, v10.8b, v7.8b\n"
+      "uadalp v0.4s, v11.8h\n"
+      "uadalp v1.4s, v12.8h\n"
+      "uadalp v2.4s, v13.8h\n"
+      "uadalp v3.4s, v14.8h\n"
+      "uadalp v4.4s, v15.8h\n"
+      "uadalp v5.4s, v16.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v6.4s}, [%x[lhs]], #16\n"
+      "ld1 {v7.4s}, [%x[rhs]], #16\n"
+      "dup v8.4s, %w[multiplicative_offset]\n"
+      "dup v9.4s, %w[rounding_offset]\n"
+      "dup v10.4s, %w[shift]\n"
+      "dup v11.4s, v6.s[0]\n"
+      "dup v6.4s, v6.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v3.4s, v3.4s, v4.4s\n"
+      "addp v5.4s, v5.4s, v5.4s\n"
+      "addp v3.4s, v3.4s, v5.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v11.4s\n"
+      "add v3.4s, v3.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v3.4s, v3.4s, v7.4s\n"
+      "mul v0.4s, v0.4s, v8.4s\n"
+      "mul v3.4s, v3.4s, v8.4s\n"
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v3.4s, v3.4s, v9.4s\n"
+      "sshl v0.4s, v0.4s, v10.4s\n"
+      "sshl v3.4s, v3.4s, v10.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn v3.4h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun v3.8b, v3.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.h}[0], [%x[result]], #2\n"
+      "st1 {v0.b}[2], [%x[result]], #1\n"
+      "st1 {v3.h}[0], [x0], #2\n"
+      "st1 {v3.b}[2], [x0], #1\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 2, 4,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 2, 4, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+      "mov v7.16b, v4.16b\n"
+
+      // 2x4 lanes loop.
+      "1:"
+
+      "ld1 {v10.8b, v11.8b, v12.8b, v13.8b}, [%x[rhs]], #32\n"
+      "ld1 {v8.8b}, [%x[lhs]], #8\n"
+      "umull v14.8h, v8.8b, v10.8b\n"
+      "ld1 {v9.8b}, [%x[lhs]], #8\n"
+      "umull v15.8h, v8.8b, v11.8b\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v16.8h, v8.8b, v12.8b\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v17.8h, v8.8b, v13.8b\n"
+      "umull v18.8h, v9.8b, v10.8b\n"
+      "uadalp v0.4s, v14.8h\n"
+      "uadalp v1.4s, v15.8h\n"
+      "uadalp v2.4s, v16.8h\n"
+      "umull v14.8h, v9.8b, v11.8b\n"
+      "umull v15.8h, v9.8b, v12.8b\n"
+      "umull v16.8h, v9.8b, v13.8b\n"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "uadalp v3.4s, v17.8h\n"
+      "uadalp v4.4s, v18.8h\n"
+      "uadalp v5.4s, v14.8h\n"
+      "uadalp v6.4s, v15.8h\n"
+      "uadalp v7.4s, v16.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v8.4s}, [%x[lhs]], #16\n"
+      "ld1 {v9.4s}, [%x[rhs]], #16\n"
+      "dup v10.4s, %w[multiplicative_offset]\n"
+      "dup v11.4s, %w[rounding_offset]\n"
+      "dup v12.4s, %w[shift]\n"
+      "dup v13.4s, v8.s[0]\n"
+      "dup v8.4s, v8.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v7.4s\n"
+      "addp v4.4s, v4.4s, v6.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v13.4s\n"
+      "add v4.4s, v4.4s, v8.4s\n"
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v4.4s, v4.4s, v9.4s\n"
+      "mul v0.4s, v0.4s, v10.4s\n"
+      "mul v4.4s, v4.4s, v10.4s\n"
+      "add v0.4s, v0.4s, v11.4s\n"
+      "add v4.4s, v4.4s, v11.4s\n"
+      "sshl v0.4s, v0.4s, v12.4s\n"
+      "sshl v4.4s, v4.4s, v12.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn v4.4h, v4.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun v4.8b, v4.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      "st1 {v4.s}[0], [x0], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 3, 1,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 3, 1, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v3.2s, v4.2s, v5.2s}, [%x[lhs]], #24\n"
+      "ld1 {v6.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v7.8h, v6.8b, v3.8b\n"
+      "umull v8.8h, v6.8b, v4.8b\n"
+      "umull v9.8h, v6.8b, v5.8b\n"
+      "uadalp v0.4s, v7.8h\n"
+      "uadalp v1.4s, v8.8h\n"
+      "uadalp v2.4s, v9.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[multiplicative_offset]\n"
+      "dup v7.4s, %w[rounding_offset]\n"
+      "dup v8.4s, %w[shift]\n"
+      "dup v3.4s, v4.s[0]\n"
+      "dup v9.4s, v4.s[1]\n"
+      "dup v4.4s, v4.s[2]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v3.4s\n"
+      "add v1.4s, v1.4s, v9.4s\n"
+      "add v2.4s, v2.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v1.4s, v1.4s, v5.4s\n"
+      "add v2.4s, v2.4s, v5.4s\n"
+      "mul v0.4s, v0.4s, v6.4s\n"
+      "mul v1.4s, v1.4s, v6.4s\n"
+      "mul v2.4s, v2.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v1.4s, v1.4s, v7.4s\n"
+      "add v2.4s, v2.4s, v7.4s\n"
+      "sshl v0.4s, v0.4s, v8.4s\n"
+      "sshl v1.4s, v1.4s, v8.4s\n"
+      "sshl v2.4s, v2.4s, v8.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn v1.4h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun v1.8b, v1.8h\n"
+      "sqxtun v2.8b, v2.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.b}[0], [%x[result]], #1\n"
+      "st1 {v1.b}[0], [x0], #1\n"
+      "st1 {v2.b}[0], [x1], #1\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 3, 2,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 3, 2, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v6.2s, v7.2s, v8.2s}, [%x[lhs]], #24\n"
+      "ld1 {v9.2s, v10.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v11.8h, v9.8b, v6.8b\n"
+      "umull v12.8h, v10.8b, v6.8b\n"
+      "umull v13.8h, v9.8b, v7.8b\n"
+      "umull v14.8h, v10.8b, v7.8b\n"
+      "umull v15.8h, v9.8b, v8.8b\n"
+      "umull v16.8h, v10.8b, v8.8b\n"
+      "uadalp v0.4s, v11.8h\n"
+      "uadalp v1.4s, v12.8h\n"
+      "uadalp v2.4s, v13.8h\n"
+      "uadalp v3.4s, v14.8h\n"
+      "uadalp v4.4s, v15.8h\n"
+      "uadalp v5.4s, v16.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v6.4s}, [%x[lhs]], #16\n"
+      "ld1 {v7.4s}, [%x[rhs]], #16\n"
+      "dup v8.4s, %w[multiplicative_offset]\n"
+      "dup v9.4s, %w[rounding_offset]\n"
+      "dup v10.4s, %w[shift]\n"
+      "dup v11.4s, v6.s[0]\n"
+      "dup v12.4s, v6.s[1]\n"
+      "dup v6.4s, v6.s[2]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v4.4s, v4.4s, v4.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v11.4s\n"
+      "add v2.4s, v2.4s, v12.4s\n"
+      "add v4.4s, v4.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v2.4s, v2.4s, v7.4s\n"
+      "add v4.4s, v4.4s, v7.4s\n"
+      "mul v0.4s, v0.4s, v8.4s\n"
+      "mul v2.4s, v2.4s, v8.4s\n"
+      "mul v4.4s, v4.4s, v8.4s\n"
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v2.4s, v2.4s, v9.4s\n"
+      "add v4.4s, v4.4s, v9.4s\n"
+      "sshl v0.4s, v0.4s, v10.4s\n"
+      "sshl v2.4s, v2.4s, v10.4s\n"
+      "sshl v4.4s, v4.4s, v10.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn v4.4h, v4.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun v2.8b, v2.8h\n"
+      "sqxtun v4.8b, v4.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.h}[0], [%x[result]], #2\n"
+      "st1 {v2.h}[0], [x0], #2\n"
+      "st1 {v4.h}[0], [x1], #2\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "v15", "v16", "cc", "memory");
+}
+
+template <>
+inline void
+MulKernel<uint8_t, uint8_t, QuantizedStaticPreprocessed, RowMajor, 3, 3,
+          8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                       const FusedKernelParams<QuantizedStaticPreprocessed,
+                                               RowMajor>& params,
+                       uint8_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedRowMajor<uint8_t, uint8_t, "
+               "QuantizedStaticPreprocessed, RowMajor, 3, 3, 8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+      "mov v7.16b, v4.16b\n"
+      "mov v8.16b, v5.16b\n"
+
+      // 3x3 lanes loop.
+      "1:"
+
+      "ld1 {v12.8b, v13.8b, v14.8b}, [%x[rhs]], #24\n"
+      "ld1 {v9.8b}, [%x[lhs]], #8\n"
+      "umull v15.8h, v9.8b, v12.8b\n"
+      "ld1 {v10.8b}, [%x[lhs]], #8\n"
+      "umull v16.8h, v9.8b, v13.8b\n"
+      "ld1 {v11.8b}, [%x[lhs]], #8\n"
+      "umull v17.8h, v9.8b, v14.8b\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v18.8h, v10.8b, v12.8b\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "uadalp v0.4s, v15.8h\n"
+      "uadalp v1.4s, v16.8h\n"
+      "uadalp v2.4s, v17.8h\n"
+      "uadalp v3.4s, v18.8h\n"
+      "umull v15.8h, v10.8b, v13.8b\n"
+      "umull v16.8h, v10.8b, v14.8b\n"
+      "umull v17.8h, v11.8b, v12.8b\n"
+      "umull v18.8h, v11.8b, v13.8b\n"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "umull v9.8h, v11.8b, v14.8b\n"
+      "uadalp v4.4s, v15.8h\n"
+      "uadalp v5.4s, v16.8h\n"
+      "uadalp v6.4s, v17.8h\n"
+      "uadalp v7.4s, v18.8h\n"
+      "uadalp v8.4s, v9.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantization::Prepare
+      "ld1 {v9.4s}, [%x[lhs]], #16\n"
+      "ld1 {v10.4s}, [%x[rhs]], #16\n"
+      "dup v11.4s, %w[multiplicative_offset]\n"
+      "dup v12.4s, %w[rounding_offset]\n"
+      "dup v13.4s, %w[shift]\n"
+      "dup v14.4s, v9.s[0]\n"
+      "dup v15.4s, v9.s[1]\n"
+      "dup v9.4s, v9.s[2]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v3.4s, v3.4s, v4.4s\n"
+      "addp v5.4s, v5.4s, v5.4s\n"
+      "addp v3.4s, v3.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v7.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v6.4s, v6.4s, v8.4s\n"
+
+      // StaticQuantization::Transform
+      "add v0.4s, v0.4s, v14.4s\n"
+      "add v3.4s, v3.4s, v15.4s\n"
+      "add v6.4s, v6.4s, v9.4s\n"
+      "add v0.4s, v0.4s, v10.4s\n"
+      "add v3.4s, v3.4s, v10.4s\n"
+      "add v6.4s, v6.4s, v10.4s\n"
+      "mul v0.4s, v0.4s, v11.4s\n"
+      "mul v3.4s, v3.4s, v11.4s\n"
+      "mul v6.4s, v6.4s, v11.4s\n"
+      "add v0.4s, v0.4s, v12.4s\n"
+      "add v3.4s, v3.4s, v12.4s\n"
+      "add v6.4s, v6.4s, v12.4s\n"
+      "sshl v0.4s, v0.4s, v13.4s\n"
+      "sshl v3.4s, v3.4s, v13.4s\n"
+      "sshl v6.4s, v6.4s, v13.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn v3.4h, v3.4s\n"
+      "sqxtn v6.4h, v6.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun v3.8b, v3.8h\n"
+      "sqxtun v6.8b, v6.8h\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.h}[0], [%x[result]], #2\n"
+      "st1 {v0.b}[2], [%x[result]], #1\n"
+      "st1 {v3.h}[0], [x0], #2\n"
+      "st1 {v3.b}[2], [x0], #1\n"
+      "st1 {v6.h}[0], [x1], #2\n"
+      "st1 {v6.b}[2], [x1], #1\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [multiplicative_offset] "r"(params.kernel.multiplicative_offset),
+        [shift] "r"(params.kernel.shift),
+        [stride] "r"(params.output_stream.stride),
+        [rounding_offset] "r"(params.kernel.rounding_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "cc",
+        "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v1.2s}, [%x[lhs]], #8\n"
+      "ld1 {v2.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v3.8h, v2.8b, v1.8b\n"
+      "uadalp v0.4s, v3.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v2.2s}, [%x[lhs]], #8\n"
+      "ld1 {v3.2s, v4.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v5.8h, v3.8b, v2.8b\n"
+      "umull v6.8h, v4.8b, v2.8b\n"
+      "uadalp v0.4s, v5.8h\n"
+      "uadalp v1.4s, v6.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v3.2s}, [%x[lhs]], #8\n"
+      "ld1 {v4.2s, v5.2s, v6.2s}, [%x[rhs]], #24\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v7.8h, v4.8b, v3.8b\n"
+      "umull v8.8h, v5.8b, v3.8b\n"
+      "umull v9.8h, v6.8b, v3.8b\n"
+      "uadalp v0.4s, v7.8h\n"
+      "uadalp v1.4s, v8.8h\n"
+      "uadalp v2.4s, v9.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v0.s}[2], [%x[result]], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "cc",
+        "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 4,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 4, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v4.2s}, [%x[lhs]], #8\n"
+      "ld1 {v5.2s, v6.2s, v7.2s, v8.2s}, [%x[rhs]], #32\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v9.8h, v5.8b, v4.8b\n"
+      "umull v10.8h, v6.8b, v4.8b\n"
+      "umull v11.8h, v7.8b, v4.8b\n"
+      "umull v12.8h, v8.8b, v4.8b\n"
+      "uadalp v0.4s, v9.8h\n"
+      "uadalp v1.4s, v10.8h\n"
+      "uadalp v2.4s, v11.8h\n"
+      "uadalp v3.4s, v12.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 5,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 5, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v5.2s, v6.2s, v7.2s, v8.2s}, [%x[rhs]], #32\n"
+      "ld1 {v9.2s}, [%x[lhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v10.8h, v5.8b, v9.8b\n"
+      "umull v11.8h, v6.8b, v9.8b\n"
+      "umull v12.8h, v7.8b, v9.8b\n"
+      "umull v13.8h, v8.8b, v9.8b\n"
+      "ld1 {v5.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[rhs], #128]\n"
+      "uadalp v0.4s, v10.8h\n"
+      "uadalp v1.4s, v11.8h\n"
+      "uadalp v2.4s, v12.8h\n"
+      "uadalp v3.4s, v13.8h\n"
+      "umull v10.8h, v5.8b, v9.8b\n"
+      "uadalp v4.4s, v10.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v5.4s}, [%x[lhs]], #16\n"
+      "ld1 {v6.4s, v7.4s}, [%x[rhs]], #32\n"
+      "dup v5.4s, v5.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v4.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v4.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v1.4s, v1.4s, v5.4s\n"
+      "add v0.4s, v0.4s, v6.4s\n"
+      "add v1.4s, v1.4s, v7.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      "st1 {v1.s}[0], [%x[result]], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 6,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 6, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v6.2s, v7.2s, v8.2s, v9.2s}, [%x[rhs]], #32\n"
+      "ld1 {v10.2s}, [%x[lhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v11.8h, v6.8b, v10.8b\n"
+      "umull v12.8h, v7.8b, v10.8b\n"
+      "umull v13.8h, v8.8b, v10.8b\n"
+      "umull v14.8h, v9.8b, v10.8b\n"
+      "ld1 {v6.2s, v7.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[rhs], #128]\n"
+      "uadalp v0.4s, v11.8h\n"
+      "uadalp v1.4s, v12.8h\n"
+      "uadalp v2.4s, v13.8h\n"
+      "uadalp v3.4s, v14.8h\n"
+      "umull v11.8h, v6.8b, v10.8b\n"
+      "umull v12.8h, v7.8b, v10.8b\n"
+      "uadalp v4.4s, v11.8h\n"
+      "uadalp v5.4s, v12.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v6.4s}, [%x[lhs]], #16\n"
+      "ld1 {v7.4s, v8.4s}, [%x[rhs]], #32\n"
+      "dup v6.4s, v6.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v4.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v6.4s\n"
+      "add v1.4s, v1.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v1.4s, v1.4s, v8.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      "st1 {v1.2s}, [%x[result]], #8\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 7,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 7, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v7.2s, v8.2s, v9.2s, v10.2s}, [%x[rhs]], #32\n"
+      "ld1 {v11.2s}, [%x[lhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v12.8h, v7.8b, v11.8b\n"
+      "umull v13.8h, v8.8b, v11.8b\n"
+      "umull v14.8h, v9.8b, v11.8b\n"
+      "umull v15.8h, v10.8b, v11.8b\n"
+      "ld1 {v7.2s, v8.2s, v9.2s}, [%x[rhs]], #24\n"
+      "prfm pldl1keep, [%x[rhs], #128]\n"
+      "uadalp v0.4s, v12.8h\n"
+      "uadalp v1.4s, v13.8h\n"
+      "uadalp v2.4s, v14.8h\n"
+      "uadalp v3.4s, v15.8h\n"
+      "umull v12.8h, v7.8b, v11.8b\n"
+      "umull v13.8h, v8.8b, v11.8b\n"
+      "umull v14.8h, v9.8b, v11.8b\n"
+      "uadalp v4.4s, v12.8h\n"
+      "uadalp v5.4s, v13.8h\n"
+      "uadalp v6.4s, v14.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v7.4s}, [%x[lhs]], #16\n"
+      "ld1 {v8.4s, v9.4s}, [%x[rhs]], #32\n"
+      "dup v7.4s, v7.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v6.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v6.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v1.4s, v1.4s, v7.4s\n"
+      "add v0.4s, v0.4s, v8.4s\n"
+      "add v1.4s, v1.4s, v9.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      "st1 {v1.2s}, [%x[result]], #8\n"
+      "st1 {v1.s}[2], [%x[result]], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 8,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 1, 8, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+      "mov v7.16b, v4.16b\n"
+
+      // 1x8 lanes loop.
+      "1:"
+
+      "ld1 {v9.2s, v10.2s, v11.2s, v12.2s}, [%x[rhs]], #32\n"
+      "ld1 {v8.2s}, [%x[lhs]], #8\n"
+      "umull v13.8h, v8.8b, v9.8b\n"
+      "umull v14.8h, v8.8b, v10.8b\n"
+      "umull v15.8h, v8.8b, v11.8b\n"
+      "umull v16.8h, v8.8b, v12.8b\n"
+      "ld1 {v9.2s, v10.2s, v11.2s, v12.2s}, [%x[rhs]], #32\n"
+      "uadalp v0.4s, v13.8h\n"
+      "uadalp v1.4s, v14.8h\n"
+      "uadalp v2.4s, v15.8h\n"
+      "uadalp v3.4s, v16.8h\n"
+      "prfm pldl1keep, [%x[rhs], #256]\n"
+      "umull v17.8h, v8.8b, v9.8b\n"
+      "umull v13.8h, v8.8b, v10.8b\n"
+      "umull v14.8h, v8.8b, v11.8b\n"
+      "umull v15.8h, v8.8b, v12.8b\n"
+      "prfm pldl1keep, [%x[lhs], #32]\n"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "uadalp v4.4s, v17.8h\n"
+      "uadalp v5.4s, v13.8h\n"
+      "uadalp v6.4s, v14.8h\n"
+      "uadalp v7.4s, v15.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v8.4s}, [%x[lhs]], #16\n"
+      "ld1 {v9.4s, v10.4s}, [%x[rhs]], #32\n"
+      "dup v8.4s, v8.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v7.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v6.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v8.4s\n"
+      "add v1.4s, v1.4s, v8.4s\n"
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v1.4s, v1.4s, v10.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s, v1.4s}, [%x[result]], #32\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v2.2s, v3.2s}, [%x[lhs]], #16\n"
+      "ld1 {v4.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v5.8h, v4.8b, v2.8b\n"
+      "umull v6.8h, v4.8b, v3.8b\n"
+      "uadalp v0.4s, v5.8h\n"
+      "uadalp v1.4s, v6.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v2.4s, v4.s[0]\n"
+      "dup v4.4s, v4.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v2.4s\n"
+      "add v1.4s, v1.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v1.4s, v1.4s, v5.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      "st1 {v1.s}[0], [x0], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v4.2s, v5.2s}, [%x[lhs]], #16\n"
+      "ld1 {v6.2s, v7.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v8.8h, v6.8b, v4.8b\n"
+      "umull v9.8h, v7.8b, v4.8b\n"
+      "umull v10.8h, v6.8b, v5.8b\n"
+      "umull v11.8h, v7.8b, v5.8b\n"
+      "uadalp v0.4s, v8.8h\n"
+      "uadalp v1.4s, v9.8h\n"
+      "uadalp v2.4s, v10.8h\n"
+      "uadalp v3.4s, v11.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, v4.s[0]\n"
+      "dup v4.4s, v4.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v6.4s\n"
+      "add v2.4s, v2.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v2.4s, v2.4s, v5.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v2.2s}, [x0], #8\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v6.2s, v7.2s}, [%x[lhs]], #16\n"
+      "ld1 {v8.2s, v9.2s, v10.2s}, [%x[rhs]], #24\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v11.8h, v8.8b, v6.8b\n"
+      "umull v12.8h, v9.8b, v6.8b\n"
+      "umull v13.8h, v10.8b, v6.8b\n"
+      "umull v14.8h, v8.8b, v7.8b\n"
+      "umull v15.8h, v9.8b, v7.8b\n"
+      "umull v16.8h, v10.8b, v7.8b\n"
+      "uadalp v0.4s, v11.8h\n"
+      "uadalp v1.4s, v12.8h\n"
+      "uadalp v2.4s, v13.8h\n"
+      "uadalp v3.4s, v14.8h\n"
+      "uadalp v4.4s, v15.8h\n"
+      "uadalp v5.4s, v16.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v6.4s}, [%x[lhs]], #16\n"
+      "ld1 {v7.4s}, [%x[rhs]], #16\n"
+      "dup v8.4s, v6.s[0]\n"
+      "dup v6.4s, v6.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v3.4s, v3.4s, v4.4s\n"
+      "addp v5.4s, v5.4s, v5.4s\n"
+      "addp v3.4s, v3.4s, v5.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v8.4s\n"
+      "add v3.4s, v3.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v3.4s, v3.4s, v7.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v0.s}[2], [%x[result]], #4\n"
+      "st1 {v3.2s}, [x0], #8\n"
+      "st1 {v3.s}[2], [x0], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 4,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 2, 4, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+      "mov v7.16b, v4.16b\n"
+
+      // 2x4 lanes loop.
+      "1:"
+
+      "ld1 {v10.8b, v11.8b, v12.8b, v13.8b}, [%x[rhs]], #32\n"
+      "ld1 {v8.8b}, [%x[lhs]], #8\n"
+      "umull v14.8h, v8.8b, v10.8b\n"
+      "ld1 {v9.8b}, [%x[lhs]], #8\n"
+      "umull v15.8h, v8.8b, v11.8b\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v16.8h, v8.8b, v12.8b\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v17.8h, v8.8b, v13.8b\n"
+      "umull v18.8h, v9.8b, v10.8b\n"
+      "uadalp v0.4s, v14.8h\n"
+      "uadalp v1.4s, v15.8h\n"
+      "uadalp v2.4s, v16.8h\n"
+      "umull v14.8h, v9.8b, v11.8b\n"
+      "umull v15.8h, v9.8b, v12.8b\n"
+      "umull v16.8h, v9.8b, v13.8b\n"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "uadalp v3.4s, v17.8h\n"
+      "uadalp v4.4s, v18.8h\n"
+      "uadalp v5.4s, v14.8h\n"
+      "uadalp v6.4s, v15.8h\n"
+      "uadalp v7.4s, v16.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v8.4s}, [%x[lhs]], #16\n"
+      "ld1 {v9.4s}, [%x[rhs]], #16\n"
+      "dup v10.4s, v8.s[0]\n"
+      "dup v8.4s, v8.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v7.4s\n"
+      "addp v4.4s, v4.4s, v6.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v10.4s\n"
+      "add v4.4s, v4.4s, v8.4s\n"
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v4.4s, v4.4s, v9.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      "st1 {v4.4s}, [x0], #16\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v3.2s, v4.2s, v5.2s}, [%x[lhs]], #24\n"
+      "ld1 {v6.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v7.8h, v6.8b, v3.8b\n"
+      "umull v8.8h, v6.8b, v4.8b\n"
+      "umull v9.8h, v6.8b, v5.8b\n"
+      "uadalp v0.4s, v7.8h\n"
+      "uadalp v1.4s, v8.8h\n"
+      "uadalp v2.4s, v9.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v3.4s, v4.s[0]\n"
+      "dup v6.4s, v4.s[1]\n"
+      "dup v4.4s, v4.s[2]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v3.4s\n"
+      "add v1.4s, v1.4s, v6.4s\n"
+      "add v2.4s, v2.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v1.4s, v1.4s, v5.4s\n"
+      "add v2.4s, v2.4s, v5.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      "st1 {v1.s}[0], [x0], #4\n"
+      "st1 {v2.s}[0], [x1], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v6.2s, v7.2s, v8.2s}, [%x[lhs]], #24\n"
+      "ld1 {v9.2s, v10.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v11.8h, v9.8b, v6.8b\n"
+      "umull v12.8h, v10.8b, v6.8b\n"
+      "umull v13.8h, v9.8b, v7.8b\n"
+      "umull v14.8h, v10.8b, v7.8b\n"
+      "umull v15.8h, v9.8b, v8.8b\n"
+      "umull v16.8h, v10.8b, v8.8b\n"
+      "uadalp v0.4s, v11.8h\n"
+      "uadalp v1.4s, v12.8h\n"
+      "uadalp v2.4s, v13.8h\n"
+      "uadalp v3.4s, v14.8h\n"
+      "uadalp v4.4s, v15.8h\n"
+      "uadalp v5.4s, v16.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v6.4s}, [%x[lhs]], #16\n"
+      "ld1 {v7.4s}, [%x[rhs]], #16\n"
+      "dup v8.4s, v6.s[0]\n"
+      "dup v9.4s, v6.s[1]\n"
+      "dup v6.4s, v6.s[2]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v4.4s, v4.4s, v4.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v8.4s\n"
+      "add v2.4s, v2.4s, v9.4s\n"
+      "add v4.4s, v4.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v2.4s, v2.4s, v7.4s\n"
+      "add v4.4s, v4.4s, v7.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v2.2s}, [x0], #8\n"
+      "st1 {v4.2s}, [x1], #8\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "v15", "v16", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, int32_t, QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsInt32,
+                                         RowMajor>& params,
+                 int32_t* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsInt32RowMajor<uint8_t, int32_t, "
+               "QuantizedStaticPreprocessedAsInt32, RowMajor, 3, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+      "mov v7.16b, v4.16b\n"
+      "mov v8.16b, v5.16b\n"
+
+      // 3x3 lanes loop.
+      "1:"
+
+      "ld1 {v12.8b, v13.8b, v14.8b}, [%x[rhs]], #24\n"
+      "ld1 {v9.8b}, [%x[lhs]], #8\n"
+      "umull v15.8h, v9.8b, v12.8b\n"
+      "ld1 {v10.8b}, [%x[lhs]], #8\n"
+      "umull v16.8h, v9.8b, v13.8b\n"
+      "ld1 {v11.8b}, [%x[lhs]], #8\n"
+      "umull v17.8h, v9.8b, v14.8b\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v18.8h, v10.8b, v12.8b\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "uadalp v0.4s, v15.8h\n"
+      "uadalp v1.4s, v16.8h\n"
+      "uadalp v2.4s, v17.8h\n"
+      "uadalp v3.4s, v18.8h\n"
+      "umull v15.8h, v10.8b, v13.8b\n"
+      "umull v16.8h, v10.8b, v14.8b\n"
+      "umull v17.8h, v11.8b, v12.8b\n"
+      "umull v18.8h, v11.8b, v13.8b\n"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "umull v9.8h, v11.8b, v14.8b\n"
+      "uadalp v4.4s, v15.8h\n"
+      "uadalp v5.4s, v16.8h\n"
+      "uadalp v6.4s, v17.8h\n"
+      "uadalp v7.4s, v18.8h\n"
+      "uadalp v8.4s, v9.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationInt32::Prepare
+      "ld1 {v9.4s}, [%x[lhs]], #16\n"
+      "ld1 {v10.4s}, [%x[rhs]], #16\n"
+      "dup v11.4s, v9.s[0]\n"
+      "dup v12.4s, v9.s[1]\n"
+      "dup v9.4s, v9.s[2]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v3.4s, v3.4s, v4.4s\n"
+      "addp v5.4s, v5.4s, v5.4s\n"
+      "addp v3.4s, v3.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v7.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v6.4s, v6.4s, v8.4s\n"
+
+      // StaticQuantizationInt32::Transform
+      "add v0.4s, v0.4s, v11.4s\n"
+      "add v3.4s, v3.4s, v12.4s\n"
+      "add v6.4s, v6.4s, v9.4s\n"
+      "add v0.4s, v0.4s, v10.4s\n"
+      "add v3.4s, v3.4s, v10.4s\n"
+      "add v6.4s, v6.4s, v10.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v0.s}[2], [%x[result]], #4\n"
+      "st1 {v3.2s}, [x0], #8\n"
+      "st1 {v3.s}[2], [x0], #4\n"
+      "st1 {v6.2s}, [x1], #8\n"
+      "st1 {v6.s}[2], [x1], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "cc",
+        "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v1.2s}, [%x[lhs]], #8\n"
+      "ld1 {v2.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v3.8h, v2.8b, v1.8b\n"
+      "uadalp v0.4s, v3.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[scale]\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v2.2s}, [%x[lhs]], #8\n"
+      "ld1 {v3.2s, v4.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v5.8h, v3.8b, v2.8b\n"
+      "umull v6.8h, v4.8b, v2.8b\n"
+      "uadalp v0.4s, v5.8h\n"
+      "uadalp v1.4s, v6.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[scale]\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v3.2s}, [%x[lhs]], #8\n"
+      "ld1 {v4.2s, v5.2s, v6.2s}, [%x[rhs]], #24\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v7.8h, v4.8b, v3.8b\n"
+      "umull v8.8h, v5.8b, v3.8b\n"
+      "umull v9.8h, v6.8b, v3.8b\n"
+      "uadalp v0.4s, v7.8h\n"
+      "uadalp v1.4s, v8.8h\n"
+      "uadalp v2.4s, v9.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[scale]\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v0.s}[2], [%x[result]], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "cc",
+        "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 4,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 4, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v4.2s}, [%x[lhs]], #8\n"
+      "ld1 {v5.2s, v6.2s, v7.2s, v8.2s}, [%x[rhs]], #32\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v9.8h, v5.8b, v4.8b\n"
+      "umull v10.8h, v6.8b, v4.8b\n"
+      "umull v11.8h, v7.8b, v4.8b\n"
+      "umull v12.8h, v8.8b, v4.8b\n"
+      "uadalp v0.4s, v9.8h\n"
+      "uadalp v1.4s, v10.8h\n"
+      "uadalp v2.4s, v11.8h\n"
+      "uadalp v3.4s, v12.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[scale]\n"
+      "dup v4.4s, v4.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 5,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 5, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v5.2s, v6.2s, v7.2s, v8.2s}, [%x[rhs]], #32\n"
+      "ld1 {v9.2s}, [%x[lhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v10.8h, v5.8b, v9.8b\n"
+      "umull v11.8h, v6.8b, v9.8b\n"
+      "umull v12.8h, v7.8b, v9.8b\n"
+      "umull v13.8h, v8.8b, v9.8b\n"
+      "ld1 {v5.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[rhs], #128]\n"
+      "uadalp v0.4s, v10.8h\n"
+      "uadalp v1.4s, v11.8h\n"
+      "uadalp v2.4s, v12.8h\n"
+      "uadalp v3.4s, v13.8h\n"
+      "umull v10.8h, v5.8b, v9.8b\n"
+      "uadalp v4.4s, v10.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v5.4s}, [%x[lhs]], #16\n"
+      "ld1 {v6.4s, v7.4s}, [%x[rhs]], #32\n"
+      "dup v8.4s, %w[scale]\n"
+      "dup v5.4s, v5.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v4.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v4.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v1.4s, v1.4s, v5.4s\n"
+      "add v0.4s, v0.4s, v6.4s\n"
+      "add v1.4s, v1.4s, v7.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      "st1 {v1.s}[0], [%x[result]], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 6,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 6, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v6.2s, v7.2s, v8.2s, v9.2s}, [%x[rhs]], #32\n"
+      "ld1 {v10.2s}, [%x[lhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v11.8h, v6.8b, v10.8b\n"
+      "umull v12.8h, v7.8b, v10.8b\n"
+      "umull v13.8h, v8.8b, v10.8b\n"
+      "umull v14.8h, v9.8b, v10.8b\n"
+      "ld1 {v6.2s, v7.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[rhs], #128]\n"
+      "uadalp v0.4s, v11.8h\n"
+      "uadalp v1.4s, v12.8h\n"
+      "uadalp v2.4s, v13.8h\n"
+      "uadalp v3.4s, v14.8h\n"
+      "umull v11.8h, v6.8b, v10.8b\n"
+      "umull v12.8h, v7.8b, v10.8b\n"
+      "uadalp v4.4s, v11.8h\n"
+      "uadalp v5.4s, v12.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v6.4s}, [%x[lhs]], #16\n"
+      "ld1 {v7.4s, v8.4s}, [%x[rhs]], #32\n"
+      "dup v9.4s, %w[scale]\n"
+      "dup v6.4s, v6.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v4.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v6.4s\n"
+      "add v1.4s, v1.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v1.4s, v1.4s, v8.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      "st1 {v1.2s}, [%x[result]], #8\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 7,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 7, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+
+      // General 1xM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v7.2s, v8.2s, v9.2s, v10.2s}, [%x[rhs]], #32\n"
+      "ld1 {v11.2s}, [%x[lhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v12.8h, v7.8b, v11.8b\n"
+      "umull v13.8h, v8.8b, v11.8b\n"
+      "umull v14.8h, v9.8b, v11.8b\n"
+      "umull v15.8h, v10.8b, v11.8b\n"
+      "ld1 {v7.2s, v8.2s, v9.2s}, [%x[rhs]], #24\n"
+      "prfm pldl1keep, [%x[rhs], #128]\n"
+      "uadalp v0.4s, v12.8h\n"
+      "uadalp v1.4s, v13.8h\n"
+      "uadalp v2.4s, v14.8h\n"
+      "uadalp v3.4s, v15.8h\n"
+      "umull v12.8h, v7.8b, v11.8b\n"
+      "umull v13.8h, v8.8b, v11.8b\n"
+      "umull v14.8h, v9.8b, v11.8b\n"
+      "uadalp v4.4s, v12.8h\n"
+      "uadalp v5.4s, v13.8h\n"
+      "uadalp v6.4s, v14.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v7.4s}, [%x[lhs]], #16\n"
+      "ld1 {v8.4s, v9.4s}, [%x[rhs]], #32\n"
+      "dup v10.4s, %w[scale]\n"
+      "dup v7.4s, v7.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v6.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v6.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v1.4s, v1.4s, v7.4s\n"
+      "add v0.4s, v0.4s, v8.4s\n"
+      "add v1.4s, v1.4s, v9.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fmul v0.4s, v0.4s, v10.4s\n"
+      "fmul v1.4s, v1.4s, v10.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      "st1 {v1.2s}, [%x[result]], #8\n"
+      "st1 {v1.s}[2], [%x[result]], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 8,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 1, 8, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+      "mov v7.16b, v4.16b\n"
+
+      // 1x8 lanes loop.
+      "1:"
+
+      "ld1 {v9.2s, v10.2s, v11.2s, v12.2s}, [%x[rhs]], #32\n"
+      "ld1 {v8.2s}, [%x[lhs]], #8\n"
+      "umull v13.8h, v8.8b, v9.8b\n"
+      "umull v14.8h, v8.8b, v10.8b\n"
+      "umull v15.8h, v8.8b, v11.8b\n"
+      "umull v16.8h, v8.8b, v12.8b\n"
+      "ld1 {v9.2s, v10.2s, v11.2s, v12.2s}, [%x[rhs]], #32\n"
+      "uadalp v0.4s, v13.8h\n"
+      "uadalp v1.4s, v14.8h\n"
+      "uadalp v2.4s, v15.8h\n"
+      "uadalp v3.4s, v16.8h\n"
+      "prfm pldl1keep, [%x[rhs], #256]\n"
+      "umull v17.8h, v8.8b, v9.8b\n"
+      "umull v13.8h, v8.8b, v10.8b\n"
+      "umull v14.8h, v8.8b, v11.8b\n"
+      "umull v15.8h, v8.8b, v12.8b\n"
+      "prfm pldl1keep, [%x[lhs], #32]\n"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "uadalp v4.4s, v17.8h\n"
+      "uadalp v5.4s, v13.8h\n"
+      "uadalp v6.4s, v14.8h\n"
+      "uadalp v7.4s, v15.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v8.4s}, [%x[lhs]], #16\n"
+      "ld1 {v9.4s, v10.4s}, [%x[rhs]], #32\n"
+      "dup v11.4s, %w[scale]\n"
+      "dup v8.4s, v8.s[0]\n"
+
+      // RowMajorOutput::Prepare
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v7.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v1.4s, v4.4s, v6.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v8.4s\n"
+      "add v1.4s, v1.4s, v8.4s\n"
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v1.4s, v1.4s, v10.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fmul v0.4s, v0.4s, v11.4s\n"
+      "fmul v1.4s, v1.4s, v11.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s, v1.4s}, [%x[result]], #32\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v2.2s, v3.2s}, [%x[lhs]], #16\n"
+      "ld1 {v4.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v5.8h, v4.8b, v2.8b\n"
+      "umull v6.8h, v4.8b, v3.8b\n"
+      "uadalp v0.4s, v5.8h\n"
+      "uadalp v1.4s, v6.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[scale]\n"
+      "dup v2.4s, v4.s[0]\n"
+      "dup v4.4s, v4.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v2.4s\n"
+      "add v1.4s, v1.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v1.4s, v1.4s, v5.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      "st1 {v1.s}[0], [x0], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v4.2s, v5.2s}, [%x[lhs]], #16\n"
+      "ld1 {v6.2s, v7.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v8.8h, v6.8b, v4.8b\n"
+      "umull v9.8h, v7.8b, v4.8b\n"
+      "umull v10.8h, v6.8b, v5.8b\n"
+      "umull v11.8h, v7.8b, v5.8b\n"
+      "uadalp v0.4s, v8.8h\n"
+      "uadalp v1.4s, v9.8h\n"
+      "uadalp v2.4s, v10.8h\n"
+      "uadalp v3.4s, v11.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[scale]\n"
+      "dup v7.4s, v4.s[0]\n"
+      "dup v4.4s, v4.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v2.4s, v2.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v2.4s, v2.4s, v5.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v2.2s}, [x0], #8\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v6.2s, v7.2s}, [%x[lhs]], #16\n"
+      "ld1 {v8.2s, v9.2s, v10.2s}, [%x[rhs]], #24\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v11.8h, v8.8b, v6.8b\n"
+      "umull v12.8h, v9.8b, v6.8b\n"
+      "umull v13.8h, v10.8b, v6.8b\n"
+      "umull v14.8h, v8.8b, v7.8b\n"
+      "umull v15.8h, v9.8b, v7.8b\n"
+      "umull v16.8h, v10.8b, v7.8b\n"
+      "uadalp v0.4s, v11.8h\n"
+      "uadalp v1.4s, v12.8h\n"
+      "uadalp v2.4s, v13.8h\n"
+      "uadalp v3.4s, v14.8h\n"
+      "uadalp v4.4s, v15.8h\n"
+      "uadalp v5.4s, v16.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v6.4s}, [%x[lhs]], #16\n"
+      "ld1 {v7.4s}, [%x[rhs]], #16\n"
+      "dup v8.4s, %w[scale]\n"
+      "dup v9.4s, v6.s[0]\n"
+      "dup v6.4s, v6.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v3.4s, v3.4s, v4.4s\n"
+      "addp v5.4s, v5.4s, v5.4s\n"
+      "addp v3.4s, v3.4s, v5.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v3.4s, v3.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v3.4s, v3.4s, v7.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v0.s}[2], [%x[result]], #4\n"
+      "st1 {v3.2s}, [x0], #8\n"
+      "st1 {v3.s}[2], [x0], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 4,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 2, 4, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+      "mov v7.16b, v4.16b\n"
+
+      // 2x4 lanes loop.
+      "1:"
+
+      "ld1 {v10.8b, v11.8b, v12.8b, v13.8b}, [%x[rhs]], #32\n"
+      "ld1 {v8.8b}, [%x[lhs]], #8\n"
+      "umull v14.8h, v8.8b, v10.8b\n"
+      "ld1 {v9.8b}, [%x[lhs]], #8\n"
+      "umull v15.8h, v8.8b, v11.8b\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v16.8h, v8.8b, v12.8b\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v17.8h, v8.8b, v13.8b\n"
+      "umull v18.8h, v9.8b, v10.8b\n"
+      "uadalp v0.4s, v14.8h\n"
+      "uadalp v1.4s, v15.8h\n"
+      "uadalp v2.4s, v16.8h\n"
+      "umull v14.8h, v9.8b, v11.8b\n"
+      "umull v15.8h, v9.8b, v12.8b\n"
+      "umull v16.8h, v9.8b, v13.8b\n"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "uadalp v3.4s, v17.8h\n"
+      "uadalp v4.4s, v18.8h\n"
+      "uadalp v5.4s, v14.8h\n"
+      "uadalp v6.4s, v15.8h\n"
+      "uadalp v7.4s, v16.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v8.4s}, [%x[lhs]], #16\n"
+      "ld1 {v9.4s}, [%x[rhs]], #16\n"
+      "dup v10.4s, %w[scale]\n"
+      "dup v11.4s, v8.s[0]\n"
+      "dup v8.4s, v8.s[1]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v7.4s\n"
+      "addp v4.4s, v4.4s, v6.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v11.4s\n"
+      "add v4.4s, v4.4s, v8.4s\n"
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v4.4s, v4.4s, v9.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v10.4s\n"
+      "fmul v4.4s, v4.4s, v10.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.4s}, [%x[result]], #16\n"
+      "st1 {v4.4s}, [x0], #16\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 1,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 1, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v3.2s, v4.2s, v5.2s}, [%x[lhs]], #24\n"
+      "ld1 {v6.2s}, [%x[rhs]], #8\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v7.8h, v6.8b, v3.8b\n"
+      "umull v8.8h, v6.8b, v4.8b\n"
+      "umull v9.8h, v6.8b, v5.8b\n"
+      "uadalp v0.4s, v7.8h\n"
+      "uadalp v1.4s, v8.8h\n"
+      "uadalp v2.4s, v9.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v4.4s}, [%x[lhs]], #16\n"
+      "ld1 {v5.4s}, [%x[rhs]], #16\n"
+      "dup v6.4s, %w[scale]\n"
+      "dup v3.4s, v4.s[0]\n"
+      "dup v7.4s, v4.s[1]\n"
+      "dup v4.4s, v4.s[2]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+      "addp v1.4s, v1.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v3.4s\n"
+      "add v1.4s, v1.4s, v7.4s\n"
+      "add v2.4s, v2.4s, v4.4s\n"
+      "add v0.4s, v0.4s, v5.4s\n"
+      "add v1.4s, v1.4s, v5.4s\n"
+      "add v2.4s, v2.4s, v5.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.s}[0], [%x[result]], #4\n"
+      "st1 {v1.s}[0], [x0], #4\n"
+      "st1 {v2.s}[0], [x1], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 2,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 2, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+
+      // General NxM lanes loop.
+      "1:"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "ld1 {v6.2s, v7.2s, v8.2s}, [%x[lhs]], #24\n"
+      "ld1 {v9.2s, v10.2s}, [%x[rhs]], #16\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "umull v11.8h, v9.8b, v6.8b\n"
+      "umull v12.8h, v10.8b, v6.8b\n"
+      "umull v13.8h, v9.8b, v7.8b\n"
+      "umull v14.8h, v10.8b, v7.8b\n"
+      "umull v15.8h, v9.8b, v8.8b\n"
+      "umull v16.8h, v10.8b, v8.8b\n"
+      "uadalp v0.4s, v11.8h\n"
+      "uadalp v1.4s, v12.8h\n"
+      "uadalp v2.4s, v13.8h\n"
+      "uadalp v3.4s, v14.8h\n"
+      "uadalp v4.4s, v15.8h\n"
+      "uadalp v5.4s, v16.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v6.4s}, [%x[lhs]], #16\n"
+      "ld1 {v7.4s}, [%x[rhs]], #16\n"
+      "dup v8.4s, %w[scale]\n"
+      "dup v9.4s, v6.s[0]\n"
+      "dup v10.4s, v6.s[1]\n"
+      "dup v6.4s, v6.s[2]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v0.4s, v0.4s, v0.4s\n"
+      "addp v2.4s, v2.4s, v3.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v4.4s, v4.4s, v5.4s\n"
+      "addp v4.4s, v4.4s, v4.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v9.4s\n"
+      "add v2.4s, v2.4s, v10.4s\n"
+      "add v4.4s, v4.4s, v6.4s\n"
+      "add v0.4s, v0.4s, v7.4s\n"
+      "add v2.4s, v2.4s, v7.4s\n"
+      "add v4.4s, v4.4s, v7.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v4.4s, v4.4s, v8.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v2.2s}, [x0], #8\n"
+      "st1 {v4.2s}, [x1], #8\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "v15", "v16", "cc", "memory");
+}
+
+template <>
+inline void MulKernel<
+    uint8_t, float, QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 3,
+    8>::Multiply(const uint8_t* lhs, const uint8_t* rhs,
+                 const FusedKernelParams<QuantizedStaticPreprocessedAsFloat,
+                                         RowMajor>& params,
+                 float* result) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") QuantizedStaticPreprocessedAsFloatRowMajor<uint8_t, float, "
+               "QuantizedStaticPreprocessedAsFloat, RowMajor, 3, 3, "
+               "8>::Multiply()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  asm volatile(
+      "prfm pldl1keep, [%x[lhs]]\n"
+      "prfm pldl1keep, [%x[rhs]]\n"
+
+      // Clear aggregators.
+      "movi v0.4s, #0\n"
+      "movi v1.4s, #0\n"
+      "movi v2.4s, #0\n"
+      "mov v3.16b, v0.16b\n"
+      "mov v4.16b, v1.16b\n"
+      "mov v5.16b, v2.16b\n"
+      "mov v6.16b, v3.16b\n"
+      "mov v7.16b, v4.16b\n"
+      "mov v8.16b, v5.16b\n"
+
+      // 3x3 lanes loop.
+      "1:"
+
+      "ld1 {v12.8b, v13.8b, v14.8b}, [%x[rhs]], #24\n"
+      "ld1 {v9.8b}, [%x[lhs]], #8\n"
+      "umull v15.8h, v9.8b, v12.8b\n"
+      "ld1 {v10.8b}, [%x[lhs]], #8\n"
+      "umull v16.8h, v9.8b, v13.8b\n"
+      "ld1 {v11.8b}, [%x[lhs]], #8\n"
+      "umull v17.8h, v9.8b, v14.8b\n"
+      "prfm pldl1keep, [%x[lhs], #64]\n"
+      "umull v18.8h, v10.8b, v12.8b\n"
+      "prfm pldl1keep, [%x[rhs], #64]\n"
+      "uadalp v0.4s, v15.8h\n"
+      "uadalp v1.4s, v16.8h\n"
+      "uadalp v2.4s, v17.8h\n"
+      "uadalp v3.4s, v18.8h\n"
+      "umull v15.8h, v10.8b, v13.8b\n"
+      "umull v16.8h, v10.8b, v14.8b\n"
+      "umull v17.8h, v11.8b, v12.8b\n"
+      "umull v18.8h, v11.8b, v13.8b\n"
+
+      // Subtract counter.
+      "subs %x[count], %x[count], #8\n"
+
+      "umull v9.8h, v11.8b, v14.8b\n"
+      "uadalp v4.4s, v15.8h\n"
+      "uadalp v5.4s, v16.8h\n"
+      "uadalp v6.4s, v17.8h\n"
+      "uadalp v7.4s, v18.8h\n"
+      "uadalp v8.4s, v9.8h\n"
+
+      // Loop break.
+      "bgt 1b\n"
+
+      // StaticQuantizationFloat::Prepare
+      "ld1 {v9.4s}, [%x[lhs]], #16\n"
+      "ld1 {v10.4s}, [%x[rhs]], #16\n"
+      "dup v11.4s, %w[scale]\n"
+      "dup v12.4s, v9.s[0]\n"
+      "dup v13.4s, v9.s[1]\n"
+      "dup v9.4s, v9.s[2]\n"
+
+      // RowMajorOutput::Prepare
+      "add x0, %x[result], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+
+      // Reduce aggregators.
+      "addp v0.4s, v0.4s, v1.4s\n"
+      "addp v2.4s, v2.4s, v2.4s\n"
+      "addp v0.4s, v0.4s, v2.4s\n"
+      "addp v3.4s, v3.4s, v4.4s\n"
+      "addp v5.4s, v5.4s, v5.4s\n"
+      "addp v3.4s, v3.4s, v5.4s\n"
+      "addp v6.4s, v6.4s, v7.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v6.4s, v6.4s, v8.4s\n"
+
+      // StaticQuantizationFloat::Transform
+      "add v0.4s, v0.4s, v12.4s\n"
+      "add v3.4s, v3.4s, v13.4s\n"
+      "add v6.4s, v6.4s, v9.4s\n"
+      "add v0.4s, v0.4s, v10.4s\n"
+      "add v3.4s, v3.4s, v10.4s\n"
+      "add v6.4s, v6.4s, v10.4s\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v11.4s\n"
+      "fmul v3.4s, v3.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+
+      // RowMajorOutput::Output
+      "st1 {v0.2s}, [%x[result]], #8\n"
+      "st1 {v0.s}[2], [%x[result]], #4\n"
+      "st1 {v3.2s}, [x0], #8\n"
+      "st1 {v3.s}[2], [x0], #4\n"
+      "st1 {v6.2s}, [x1], #8\n"
+      "st1 {v6.s}[2], [x1], #4\n"
+      : [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
+      : [count] "r"(params.kernel.count),
+        [stride] "r"(params.output_stream.stride),
+        [scale] "r"(params.kernel.scale)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "cc",
+        "memory");
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm for arm64 requires: GEMMLOWP_NEON_64!"
+#endif
+
+#endif  // GEMMLOWP_META_QUANTIZED_MUL_KERNELS_ARM_64_H_
diff --git a/meta/single_thread_gemm.h b/meta/single_thread_gemm.h
index 4c21fb0..258de69 100644
--- a/meta/single_thread_gemm.h
+++ b/meta/single_thread_gemm.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -11,38093 +11,678 @@
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
-//
-// single_thread_gemm.h: programatically generated GEMM library header.
 
 #ifndef GEMMLOWP_META_SINGLE_THREAD_GEMM_H_
 #define GEMMLOWP_META_SINGLE_THREAD_GEMM_H_
 
-#ifdef GEMMLOWP_NEON_32
-
-#include <cassert>
+#include <iostream>
+#include "base.h"
 
 namespace gemmlowp {
 namespace meta {
+
+template <typename Executor, typename Params, int kernel_m, int kernel_n,
+          int kernel_k>
+void Gemm(const Params& params);
+
+class GemmExecutorPackRHS {
+ public:
+  template <typename P>
+  static int EstimateScratchSize(const P& params, int kernel_m, int kernel_n,
+                                 int kernel_k) {
+    const int lhs_scratch =
+        StreamUtil<typename P::InType, typename P::LeftStream>::Scratch(
+            params.left_stream, kernel_m, kernel_k);
+    const int rhs_chunks = ((params.n + kernel_n - 1) / kernel_n);
+    const int rhs_scratch =
+        rhs_chunks *
+        StreamUtil<typename P::InType, typename P::RightStream>::Scratch(
+            params.right_stream, kernel_n, kernel_k);
+    return AlignTo<64 * 1024>(lhs_scratch + rhs_scratch);
+  }
+
+  template <typename P, int m, int n, int k, int m_leftovers, int n_leftovers,
+            int k_leftovers>
+  static void ExecuteDispatch3D(const P& params) {
+    // Shorthand typedefs for streams and multiply kernels.
+    typedef typename P::InType InType;
+    typedef typename P::OutType OutType;
+
+    typedef Stream<typename P::InType, m, k, k_leftovers,
+                   typename P::LeftStream>
+        LeftStreamF;
+    typedef Stream<typename P::InType, m_leftovers, k, k_leftovers,
+                   typename P::LeftStream>
+        LeftStreamL;
+
+    typedef Stream<typename P::InType, n, k, k_leftovers,
+                   typename P::RightStream>
+        RightStreamF;
+    typedef Stream<typename P::InType, n_leftovers, k, k_leftovers,
+                   typename P::RightStream>
+        RightStreamL;
+
+    typedef Stream<typename P::OutType, m, n, 0, typename P::OutputStream>
+        OutputStreamFF;
+    typedef Stream<typename P::OutType, m_leftovers, n, 0,
+                   typename P::OutputStream>
+        OutputStreamLF;
+
+    typedef MulKernel<typename P::InType, typename P::OutType,
+                      typename P::Kernel, typename P::OutputStream, m, n, k>
+        KernelFF;
+    typedef MulKernel<typename P::InType, typename P::OutType,
+                      typename P::Kernel, typename P::OutputStream, m,
+                      n_leftovers, k>
+        KernelFL;
+    typedef MulKernel<typename P::InType, typename P::OutType,
+                      typename P::Kernel, typename P::OutputStream, m_leftovers,
+                      n, k>
+        KernelLF;
+    typedef MulKernel<typename P::InType, typename P::OutType,
+                      typename P::Kernel, typename P::OutputStream, m_leftovers,
+                      n_leftovers, k>
+        KernelLL;
+
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "GemmExecutor(" << typeid(P).name() << "): " << m << "x" << n
+              << "x" << k << " -- " << m_leftovers << "x" << n_leftovers << "x"
+              << k_leftovers << " -- " << params.m << "x" << params.n << "x"
+              << params.k << std::endl;
+    LeftStreamF::Debug(params.left_stream);
+    LeftStreamL::Debug(params.left_stream);
+
+    RightStreamF::Debug(params.right_stream);
+    RightStreamL::Debug(params.right_stream);
+
+    OutputStreamFF::Debug(params.fused_kernel.output_stream);
+    OutputStreamLF::Debug(params.fused_kernel.output_stream);
+
+    KernelFF::Debug(params.fused_kernel);
+    KernelFL::Debug(params.fused_kernel);
+    KernelLF::Debug(params.fused_kernel);
+    KernelLL::Debug(params.fused_kernel);
+#endif
+#endif
+
+    int lhs_chunks = params.m / m;
+    int rhs_chunks = params.n / n;
+
+    // Scratch memory for packed LHS & RHS chunks.
+
+    std::uint8_t* packed_lhs = params.scratch;
+    std::uint8_t* packed_rhs =
+        params.scratch + LeftStreamF::Scratch(params.left_stream);
+
+    // Pack full RHS first.
+
+    std::uint8_t* packed_rhs_chunk = packed_rhs;
+    const int packed_rhs_chunk_size =
+        RightStreamF::PackedStride(params.right_stream);
+
+    {
+      const std::uint8_t* rhs_chunk =
+          reinterpret_cast<const std::uint8_t*>(params.rhs);
+      const int rhs_chunk_size =
+          RightStreamF::UnpackedStride(params.right_stream);
+
+      for (int i = 0; i < rhs_chunks; ++i) {
+        RightStreamF::Pack(reinterpret_cast<const InType*>(rhs_chunk),
+                           params.right_stream,
+                           reinterpret_cast<InType*>(packed_rhs_chunk));
+
+        rhs_chunk += rhs_chunk_size;
+        packed_rhs_chunk += packed_rhs_chunk_size;
+      }
+
+      RightStreamL::Pack(reinterpret_cast<const InType*>(rhs_chunk),
+                         params.right_stream,
+                         reinterpret_cast<InType*>(packed_rhs_chunk));
+    }
+
+    // Multiply RHS by LHS one LHS chunk at a time.
+
+    const std::uint8_t* lhs_chunk =
+        reinterpret_cast<const std::uint8_t*>(params.lhs);
+    std::uint8_t* result_strip = reinterpret_cast<std::uint8_t*>(params.result);
+    std::uint8_t* result_chunk = result_strip;
+
+    {
+      const int lhs_chunk_size =
+          LeftStreamF::UnpackedStride(params.left_stream);
+      const int result_strip_size =
+          OutputStreamFF::UnpackedStride(params.fused_kernel.output_stream);
+      const int result_chunk_size =
+          OutputStreamFF::UnpackedAdvance(params.fused_kernel.output_stream);
+
+      for (int i = 0; i < lhs_chunks; ++i) {
+        LeftStreamF::Pack(reinterpret_cast<const InType*>(lhs_chunk),
+                          params.left_stream,
+                          reinterpret_cast<InType*>(packed_lhs));
+
+        result_chunk = result_strip;
+        packed_rhs_chunk = packed_rhs;
+
+        for (int j = 0; j < rhs_chunks; ++j) {
+          KernelFF::Multiply(reinterpret_cast<const InType*>(packed_lhs),
+                             reinterpret_cast<const InType*>(packed_rhs_chunk),
+                             params.fused_kernel,
+                             reinterpret_cast<OutType*>(result_chunk));
+
+          result_chunk += result_chunk_size;
+          packed_rhs_chunk += packed_rhs_chunk_size;
+        }
+
+        KernelFL::Multiply(reinterpret_cast<const InType*>(packed_lhs),
+                           reinterpret_cast<const InType*>(packed_rhs_chunk),
+                           params.fused_kernel,
+                           reinterpret_cast<OutType*>(result_chunk));
+
+        lhs_chunk += lhs_chunk_size;
+        result_strip += result_strip_size;
+      }
+    }
+
+    // Leftover LHS chunk.
+    if (m_leftovers > 0) {  // static if
+      const int result_chunk_size =
+          OutputStreamLF::UnpackedAdvance(params.fused_kernel.output_stream);
+
+      LeftStreamL::Pack(reinterpret_cast<const InType*>(lhs_chunk),
+                        params.left_stream,
+                        reinterpret_cast<InType*>(packed_lhs));
+
+      result_chunk = result_strip;
+      packed_rhs_chunk = packed_rhs;
+
+      for (int i = 0; i < rhs_chunks; ++i) {
+        KernelLF::Multiply(reinterpret_cast<const InType*>(packed_lhs),
+                           reinterpret_cast<const InType*>(packed_rhs_chunk),
+                           params.fused_kernel,
+                           reinterpret_cast<OutType*>(result_chunk));
+
+        result_chunk += result_chunk_size;
+        packed_rhs_chunk += packed_rhs_chunk_size;
+      }
+
+      KernelLL::Multiply(reinterpret_cast<const InType*>(packed_lhs),
+                         reinterpret_cast<const InType*>(packed_rhs_chunk),
+                         params.fused_kernel,
+                         reinterpret_cast<OutType*>(result_chunk));
+    }
+  }
+};
+
+class GemmExecutorPackLHS {
+ public:
+  template <typename P>
+  static int EstimateScratchSize(const P& params, int kernel_m, int kernel_n,
+                                 int kernel_k) {
+    const int lhs_chunks = ((params.m + kernel_m - 1) / kernel_m);
+    const int lhs_scratch =
+        lhs_chunks *
+        StreamUtil<typename P::InType, typename P::LeftStream>::Scratch(
+            params.left_stream, kernel_m, kernel_k);
+    const int rhs_scratch =
+        StreamUtil<typename P::InType, typename P::RightStream>::Scratch(
+            params.right_stream, kernel_n, kernel_k);
+    return AlignTo<64 * 1024>(lhs_scratch + rhs_scratch);
+  }
+
+  template <typename P, int m, int n, int k, int m_leftovers, int n_leftovers,
+            int k_leftovers>
+  static void ExecuteDispatch3D(const P& params) {
+    // Shorthand typedefs for streams and multiply kernels.
+    typedef typename P::InType InType;
+    typedef typename P::OutType OutType;
+
+    typedef Stream<typename P::InType, m, k, k_leftovers,
+                   typename P::LeftStream>
+        LeftStreamF;
+    typedef Stream<typename P::InType, m_leftovers, k, k_leftovers,
+                   typename P::LeftStream>
+        LeftStreamL;
+
+    typedef Stream<typename P::InType, n, k, k_leftovers,
+                   typename P::RightStream>
+        RightStreamF;
+    typedef Stream<typename P::InType, n_leftovers, k, k_leftovers,
+                   typename P::RightStream>
+        RightStreamL;
+
+    typedef Stream<typename P::OutType, m, n, 0, typename P::OutputStream>
+        OutputStreamFF;
+    typedef Stream<typename P::OutType, m, n_leftovers, 0,
+                   typename P::OutputStream>
+        OutputStreamFL;
+
+    typedef MulKernel<typename P::InType, typename P::OutType,
+                      typename P::Kernel, typename P::OutputStream, m, n, k>
+        KernelFF;
+    typedef MulKernel<typename P::InType, typename P::OutType,
+                      typename P::Kernel, typename P::OutputStream, m,
+                      n_leftovers, k>
+        KernelFL;
+    typedef MulKernel<typename P::InType, typename P::OutType,
+                      typename P::Kernel, typename P::OutputStream, m_leftovers,
+                      n, k>
+        KernelLF;
+    typedef MulKernel<typename P::InType, typename P::OutType,
+                      typename P::Kernel, typename P::OutputStream, m_leftovers,
+                      n_leftovers, k>
+        KernelLL;
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "GemmExecutor(" << typeid(P).name() << "): " << m << "x" << n
+              << "x" << k << " -- " << m_leftovers << "x" << n_leftovers << "x"
+              << k_leftovers << " -- " << params.m << "x" << params.n << "x"
+              << params.k << std::endl;
+    LeftStreamF::Debug(params.left_stream);
+    LeftStreamL::Debug(params.left_stream);
+
+    RightStreamF::Debug(params.right_stream);
+    RightStreamL::Debug(params.right_stream);
+
+    OutputStreamFF::Debug(params.fused_kernel.output_stream);
+    OutputStreamFL::Debug(params.fused_kernel.output_stream);
+
+    KernelFF::Debug(params.fused_kernel);
+    KernelFL::Debug(params.fused_kernel);
+    KernelLF::Debug(params.fused_kernel);
+    KernelLL::Debug(params.fused_kernel);
+#endif
+#endif
+
+    int lhs_chunks = params.m / m;
+    int rhs_chunks = params.n / n;
+
+    // Scratch memory for packed LHS & RHS chunks.
+    std::uint8_t* packed_rhs = params.scratch;
+    std::uint8_t* packed_lhs =
+        params.scratch + RightStreamF::Scratch(params.right_stream);
+
+    // Pack full LHS first.
+
+    std::uint8_t* packed_lhs_chunk = packed_lhs;
+    const int packed_lhs_chunk_size =
+        LeftStreamF::PackedStride(params.left_stream);
+
+    {
+      const std::uint8_t* lhs_chunk =
+          reinterpret_cast<const std::uint8_t*>(params.lhs);
+      const int lhs_chunk_size =
+          LeftStreamF::UnpackedStride(params.left_stream);
+
+      for (int i = 0; i < lhs_chunks; ++i) {
+        LeftStreamF::Pack(reinterpret_cast<const InType*>(lhs_chunk),
+                          params.left_stream,
+                          reinterpret_cast<InType*>(packed_lhs_chunk));
+
+        lhs_chunk += lhs_chunk_size;
+        packed_lhs_chunk += packed_lhs_chunk_size;
+      }
+
+      LeftStreamL::Pack(reinterpret_cast<const InType*>(lhs_chunk),
+                        params.left_stream,
+                        reinterpret_cast<InType*>(packed_lhs_chunk));
+    }
+
+    // Multiply RHS by LHS one RHS chunk at a time.
+
+    const std::uint8_t* rhs_chunk =
+        reinterpret_cast<const std::uint8_t*>(params.rhs);
+    std::uint8_t* result_strip = reinterpret_cast<std::uint8_t*>(params.result);
+    std::uint8_t* result_chunk = result_strip;
+
+    {
+      const int rhs_chunk_size =
+          RightStreamF::UnpackedStride(params.right_stream);
+      const int result_strip_size =
+          OutputStreamFF::UnpackedAdvance(params.fused_kernel.output_stream);
+      const int result_chunk_size =
+          OutputStreamFF::UnpackedStride(params.fused_kernel.output_stream);
+
+      for (int i = 0; i < rhs_chunks; ++i) {
+        RightStreamF::Pack(reinterpret_cast<const InType*>(rhs_chunk),
+                           params.right_stream,
+                           reinterpret_cast<InType*>(packed_rhs));
+
+        result_chunk = result_strip;
+        packed_lhs_chunk = packed_lhs;
+
+        for (int j = 0; j < lhs_chunks; ++j) {
+          KernelFF::Multiply(reinterpret_cast<const InType*>(packed_lhs_chunk),
+                             reinterpret_cast<const InType*>(packed_rhs),
+                             params.fused_kernel,
+                             reinterpret_cast<OutType*>(result_chunk));
+
+          result_chunk += result_chunk_size;
+          packed_lhs_chunk += packed_lhs_chunk_size;
+        }
+
+        KernelLF::Multiply(reinterpret_cast<const InType*>(packed_lhs_chunk),
+                           reinterpret_cast<const InType*>(packed_rhs),
+                           params.fused_kernel,
+                           reinterpret_cast<OutType*>(result_chunk));
+
+        rhs_chunk += rhs_chunk_size;
+        result_strip += result_strip_size;
+      }
+    }
+
+    // Leftover RHS chunk.
+    if (n_leftovers > 0) {  // static if
+      const int result_chunk_size =
+          OutputStreamFL::UnpackedStride(params.fused_kernel.output_stream);
+
+      RightStreamL::Pack(reinterpret_cast<const InType*>(rhs_chunk),
+                         params.right_stream,
+                         reinterpret_cast<InType*>(packed_rhs));
+
+      result_chunk = result_strip;
+      packed_lhs_chunk = packed_lhs;
+
+      for (int i = 0; i < lhs_chunks; ++i) {
+        KernelFL::Multiply(reinterpret_cast<const InType*>(packed_lhs_chunk),
+                           reinterpret_cast<const InType*>(packed_rhs),
+                           params.fused_kernel,
+                           reinterpret_cast<OutType*>(result_chunk));
+
+        result_chunk += result_chunk_size;
+        packed_lhs_chunk += packed_lhs_chunk_size;
+      }
+
+      KernelLL::Multiply(reinterpret_cast<const InType*>(packed_lhs_chunk),
+                         reinterpret_cast<const InType*>(packed_rhs),
+                         params.fused_kernel,
+                         reinterpret_cast<OutType*>(result_chunk));
+    }
+  }
+};
+
 namespace internal {
 
-void zip_1x8_aligned(const std::uint8_t* source, std::int32_t count,
-                     std::int32_t stride, std::uint8_t* destination,
-                     std::int32_t multiplicative_offset,
-                     std::int32_t additive_offset) {
-  asm volatile(
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
+inline int CalculateCacheFriendlyTasksCount(int cache_size, int constant_memory,
+                                            int per_chunk_memory, int total_dim,
+                                            int chunk_dim) {
+  assert(constant_memory + per_chunk_memory < cache_size);
+  const int available_cache = cache_size - constant_memory;
+  const int available_chunks = available_cache / per_chunk_memory;
+  const int chunks_count = (total_dim + chunk_dim - 1) / chunk_dim;
+  return (chunks_count + available_chunks - 1) / available_chunks;
 }
 
-void zip_1x8_1_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #1\n"
-      "vmov.i16 q2, #0\n"
+template <typename Params>
+inline void UpdateCacheFriendlyTask(int m_offset, int m, int n_offset, int n,
+                                    const Params& params, Params* task_params) {
+  task_params->m = m;
+  task_params->lhs =
+      StreamUtil<typename Params::InType, typename Params::LeftStream>::Offset(
+          params.left_stream, params.lhs, m_offset, 0);
 
-      "1:"
-      "subs %[count], %[count], #8\n"
+  task_params->n = n;
+  task_params->rhs =
+      StreamUtil<typename Params::InType, typename Params::RightStream>::Offset(
+          params.right_stream, params.rhs, n_offset, 0);
 
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.8 {d0[0]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_2_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #2\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_3_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #3\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]!\n"
-      "vld1.8 {d0[2]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_4_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #4\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_5_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #5\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.8 {d0[4]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_6_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #6\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.16 {d0[2]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_7_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #7\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.16 {d0[2]}, [%[source]]!\n"
-      "vld1.8 {d0[6]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_2x8_aligned(const std::uint8_t* source, std::int32_t count,
-                     std::int32_t stride, std::uint8_t* destination,
-                     std::int32_t multiplicative_offset,
-                     std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_1_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #1\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.8 {d0[0]}, [%[source]]\n"
-      "vld1.8 {d1[0]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_2_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #2\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]\n"
-      "vld1.16 {d1[0]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_3_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #3\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]!\n"
-      "vld1.16 {d1[0]}, [r0]!\n"
-      "vld1.8 {d0[2]}, [%[source]]\n"
-      "vld1.8 {d1[2]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_4_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #4\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]\n"
-      "vld1.32 {d1[0]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_5_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #5\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.8 {d0[4]}, [%[source]]\n"
-      "vld1.8 {d1[4]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_6_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #6\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.16 {d0[2]}, [%[source]]\n"
-      "vld1.16 {d1[2]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_7_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #7\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.16 {d0[2]}, [%[source]]!\n"
-      "vld1.16 {d1[2]}, [r0]!\n"
-      "vld1.8 {d0[6]}, [%[source]]\n"
-      "vld1.8 {d1[6]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_3x8_aligned(const std::uint8_t* source, std::int32_t count,
-                     std::int32_t stride, std::uint8_t* destination,
-                     std::int32_t multiplicative_offset,
-                     std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vld1.8 {d2}, [r1:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_1_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #1\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vld1.8 {d2}, [r1:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.8 {d0[0]}, [%[source]]\n"
-      "vld1.8 {d1[0]}, [r0]\n"
-      "vld1.8 {d2[0]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_2_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #2\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vld1.8 {d2}, [r1:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]\n"
-      "vld1.16 {d1[0]}, [r0]\n"
-      "vld1.16 {d2[0]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_3_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #3\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vld1.8 {d2}, [r1:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]!\n"
-      "vld1.16 {d1[0]}, [r0]!\n"
-      "vld1.16 {d2[0]}, [r1]!\n"
-      "vld1.8 {d0[2]}, [%[source]]\n"
-      "vld1.8 {d1[2]}, [r0]\n"
-      "vld1.8 {d2[2]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_4_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #4\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vld1.8 {d2}, [r1:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]\n"
-      "vld1.32 {d1[0]}, [r0]\n"
-      "vld1.32 {d2[0]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_5_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #5\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vld1.8 {d2}, [r1:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.32 {d2[0]}, [r1]!\n"
-      "vld1.8 {d0[4]}, [%[source]]\n"
-      "vld1.8 {d1[4]}, [r0]\n"
-      "vld1.8 {d2[4]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_6_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #6\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vld1.8 {d2}, [r1:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.32 {d2[0]}, [r1]!\n"
-      "vld1.16 {d0[2]}, [%[source]]\n"
-      "vld1.16 {d1[2]}, [r0]\n"
-      "vld1.16 {d2[2]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_7_aligned(const std::uint8_t* source, std::int32_t count,
-                       std::int32_t stride, std::uint8_t* destination,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #7\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]:64]!\n"
-      "vld1.8 {d1}, [r0:64]!\n"
-      "vld1.8 {d2}, [r1:64]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.32 {d2[0]}, [r1]!\n"
-      "vld1.16 {d0[2]}, [%[source]]!\n"
-      "vld1.16 {d1[2]}, [r0]!\n"
-      "vld1.16 {d2[2]}, [r1]!\n"
-      "vld1.8 {d0[6]}, [%[source]]\n"
-      "vld1.8 {d1[6]}, [r0]\n"
-      "vld1.8 {d2[6]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_1x8(const std::uint8_t* source, std::int32_t count,
-             std::int32_t stride, std::uint8_t* destination,
-             std::int32_t multiplicative_offset, std::int32_t additive_offset) {
-  asm volatile(
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_1(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #1\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.8 {d0[0]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_2(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #2\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_3(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #3\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]!\n"
-      "vld1.8 {d0[2]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_4(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #4\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_5(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #5\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.8 {d0[4]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_6(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #6\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.16 {d0[2]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_1x8_7(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "sub %[count], %[count], #7\n"
-      "vmov.i16 q2, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.16 {d0[2]}, [%[source]]!\n"
-      "vld1.8 {d0[6]}, [%[source]]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vst1.8 {d0}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d1[0], %[multiplicative_offset]\n"
-      "vdup.32 q1, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpadd.u32 d6, d4, d5\n"
-      "vpadd.u32 d8, d6, d6\n"
-      "vmul.i32 q4, q4, d1[0]\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vst1.32 {d8[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "cc", "memory");
-}
-
-void zip_2x8(const std::uint8_t* source, std::int32_t count,
-             std::int32_t stride, std::uint8_t* destination,
-             std::int32_t multiplicative_offset, std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_1(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #1\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.8 {d0[0]}, [%[source]]\n"
-      "vld1.8 {d1[0]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_2(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #2\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]\n"
-      "vld1.16 {d1[0]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_3(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #3\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]!\n"
-      "vld1.16 {d1[0]}, [r0]!\n"
-      "vld1.8 {d0[2]}, [%[source]]\n"
-      "vld1.8 {d1[2]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_4(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #4\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]\n"
-      "vld1.32 {d1[0]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_5(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #5\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.8 {d0[4]}, [%[source]]\n"
-      "vld1.8 {d1[4]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_6(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #6\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.16 {d0[2]}, [%[source]]\n"
-      "vld1.16 {d1[2]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_2x8_7(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "sub %[count], %[count], #7\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.16 {d0[2]}, [%[source]]!\n"
-      "vld1.16 {d1[2]}, [r0]!\n"
-      "vld1.8 {d0[6]}, [%[source]]\n"
-      "vld1.8 {d1[6]}, [r0]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vst1.8 {d0, d1}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d2[0], %[multiplicative_offset]\n"
-      "vdup.32 q4, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpadd.u32 d3, d4, d5\n"
-      "vpadd.u32 d10, d6, d7\n"
-      "vpadd.u32 d12, d3, d10\n"
-      "vmul.i32 q6, q6, d2[0]\n"
-      "vadd.i32 q6, q6, q4\n"
-      "vst1.32 {d12}, [%[destination]:64]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d12", "d13", "cc", "memory");
-}
-
-void zip_3x8(const std::uint8_t* source, std::int32_t count,
-             std::int32_t stride, std::uint8_t* destination,
-             std::int32_t multiplicative_offset, std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vld1.8 {d2}, [r1]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_1(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #1\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vld1.8 {d2}, [r1]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.8 {d0[0]}, [%[source]]\n"
-      "vld1.8 {d1[0]}, [r0]\n"
-      "vld1.8 {d2[0]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_2(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #2\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vld1.8 {d2}, [r1]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]\n"
-      "vld1.16 {d1[0]}, [r0]\n"
-      "vld1.16 {d2[0]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_3(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #3\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vld1.8 {d2}, [r1]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.16 {d0[0]}, [%[source]]!\n"
-      "vld1.16 {d1[0]}, [r0]!\n"
-      "vld1.16 {d2[0]}, [r1]!\n"
-      "vld1.8 {d0[2]}, [%[source]]\n"
-      "vld1.8 {d1[2]}, [r0]\n"
-      "vld1.8 {d2[2]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_4(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #4\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vld1.8 {d2}, [r1]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]\n"
-      "vld1.32 {d1[0]}, [r0]\n"
-      "vld1.32 {d2[0]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_5(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #5\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vld1.8 {d2}, [r1]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.32 {d2[0]}, [r1]!\n"
-      "vld1.8 {d0[4]}, [%[source]]\n"
-      "vld1.8 {d1[4]}, [r0]\n"
-      "vld1.8 {d2[4]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_6(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #6\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vld1.8 {d2}, [r1]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.32 {d2[0]}, [r1]!\n"
-      "vld1.16 {d0[2]}, [%[source]]\n"
-      "vld1.16 {d1[2]}, [r0]\n"
-      "vld1.16 {d2[2]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-void zip_3x8_7(const std::uint8_t* source, std::int32_t count,
-               std::int32_t stride, std::uint8_t* destination,
-               std::int32_t multiplicative_offset,
-               std::int32_t additive_offset) {
-  asm volatile(
-      "add r0, %[source], %[stride]\n"
-      "add r1, r0, %[stride]\n"
-      "sub %[count], %[count], #7\n"
-      "vmov.i16 q2, #0\n"
-      "vmov.i16 q3, #0\n"
-      "vmov.i16 q4, #0\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-
-      // Load Aggregate Store.
-      "vld1.8 {d0}, [%[source]]!\n"
-      "vld1.8 {d1}, [r0]!\n"
-      "vld1.8 {d2}, [r1]!\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-
-      // Leftover Load Aggregate Store.
-      "vmov.i8 d0, #0\n"
-      "vmov.i8 d1, #0\n"
-      "vmov.i8 d2, #0\n"
-      "vld1.32 {d0[0]}, [%[source]]!\n"
-      "vld1.32 {d1[0]}, [r0]!\n"
-      "vld1.32 {d2[0]}, [r1]!\n"
-      "vld1.16 {d0[2]}, [%[source]]!\n"
-      "vld1.16 {d1[2]}, [r0]!\n"
-      "vld1.16 {d2[2]}, [r1]!\n"
-      "vld1.8 {d0[6]}, [%[source]]\n"
-      "vld1.8 {d1[6]}, [r0]\n"
-      "vld1.8 {d2[6]}, [r1]\n"
-      "vaddw.u8 q2, q2, d0\n"
-      "vaddw.u8 q3, q3, d1\n"
-      "vaddw.u8 q4, q4, d2\n"
-      "vst1.8 {d0, d1, d2}, [%[destination]:64]!\n"
-
-      // Aggregator Reduction.
-      "vmov.32 d3[0], %[multiplicative_offset]\n"
-      "vdup.32 q5, %[additive_offset]\n"
-      "vpaddl.u16 q2, q2\n"
-      "vpaddl.u16 q3, q3\n"
-      "vpaddl.u16 q4, q4\n"
-      "vpadd.u32 d12, d4, d5\n"
-      "vpadd.u32 d13, d6, d7\n"
-      "vpadd.u32 d14, d8, d9\n"
-      "vpadd.u32 d16, d12, d13\n"
-      "vpadd.u32 d17, d14, d14\n"
-      "vmul.i32 q8, q8, d3[0]\n"
-      "vadd.i32 q8, q8, q5\n"
-      "vst1.32 {d16}, [%[destination]:64]!\n"
-      "vst1.32 {d17[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [additive_offset] "+r"(additive_offset), [stride] "+r"(stride),
-        [destination] "+r"(destination), [source] "+r"(source)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d16", "d17", "cc", "memory");
-}
-
-inline void mul_1x8_1x8_int32_rhsadd(const std::uint8_t* lhs,
-                                     const std::uint8_t* rhs,
-                                     std::int32_t count, std::int32_t* result,
-                                     std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d2}, [%[lhs]:64]!\n"
-      "vld1.8 {d3}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q2, d3, d2\n"
-      "vpadal.u16 q0, q2\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d0\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d8\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0[0]}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d8", "cc", "memory");
-}
-
-inline void mul_1x8_2x8_int32_rhsadd(const std::uint8_t* lhs,
-                                     const std::uint8_t* rhs,
-                                     std::int32_t count, std::int32_t* result,
-                                     std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d4}, [%[lhs]:64]!\n"
-      "vld1.8 {d5, d6}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q4, d5, d4\n"
-      "vmull.u8 q5, d6, d4\n"
-      "vpadal.u16 q0, q4\n"
-      "vpadal.u16 q1, q5\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d8\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10", "d11",
-        "cc", "memory");
-}
-
-inline void mul_1x8_3x8_int32_rhsadd(const std::uint8_t* lhs,
-                                     const std::uint8_t* rhs,
-                                     std::int32_t count, std::int32_t* result,
-                                     std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d6}, [%[lhs]:64]!\n"
-      "vld1.8 {d7, d8, d9}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q5, d7, d6\n"
-      "vmull.u8 q6, d8, d6\n"
-      "vmull.u8 q7, d9, d6\n"
-      "vpadal.u16 q0, q5\n"
-      "vpadal.u16 q1, q6\n"
-      "vpadal.u16 q2, q7\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {q4}, [%[rhs]:64]\n"
-
-      // Change stride because storing in two ops.
-      "sub %[result_stride], %[result_stride], #8\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d1, d4, d4\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 q0, q0, q4\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]]!\n"
-      "vst1.32 {d1[0]}, [%[result]], %[result_stride]\n"
-
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "cc", "memory");
-}
-
-inline void mul_2x8_1x8_int32_rhsadd(const std::uint8_t* lhs,
-                                     const std::uint8_t* rhs,
-                                     std::int32_t count, std::int32_t* result,
-                                     std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d4, d5}, [%[lhs]:64]!\n"
-      "vld1.8 {d6}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q4, d6, d4\n"
-      "vmull.u8 q5, d6, d5\n"
-      "vpadal.u16 q0, q4\n"
-      "vpadal.u16 q1, q5\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d0\n"
-      "vpadd.u32 d2, d2, d2\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d8\n"
-      "vadd.s32 d2, d2, d8\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0[0]}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d2[0]}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10", "d11",
-        "cc", "memory");
-}
-
-inline void mul_2x8_2x8_int32_rhsadd(const std::uint8_t* lhs,
-                                     const std::uint8_t* rhs,
-                                     std::int32_t count, std::int32_t* result,
-                                     std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d8, d9}, [%[lhs]:64]!\n"
-      "vld1.8 {d10, d11}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q6, d10, d8\n"
-      "vmull.u8 q7, d11, d8\n"
-      "vmull.u8 q8, d10, d9\n"
-      "vmull.u8 q9, d11, d9\n"
-      "vpadal.u16 q0, q6\n"
-      "vpadal.u16 q1, q7\n"
-      "vpadal.u16 q2, q8\n"
-      "vpadal.u16 q3, q9\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d4, d4, d6\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d8\n"
-      "vadd.s32 d4, d4, d8\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d4}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "cc",
-        "memory");
-}
-
-inline void mul_2x8_3x8_int32_rhsadd(const std::uint8_t* lhs,
-                                     const std::uint8_t* rhs,
-                                     std::int32_t count, std::int32_t* result,
-                                     std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-      "vmov.i32 q4, q1\n"
-      "vmov.i32 q5, q2\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d12, d13}, [%[lhs]:64]!\n"
-      "vld1.8 {d14, d15, d16}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q9, d14, d12\n"
-      "vmull.u8 q10, d15, d12\n"
-      "vmull.u8 q11, d16, d12\n"
-      "vmull.u8 q12, d14, d13\n"
-      "vmull.u8 q13, d15, d13\n"
-      "vmull.u8 q14, d16, d13\n"
-      "vpadal.u16 q0, q9\n"
-      "vpadal.u16 q1, q10\n"
-      "vpadal.u16 q2, q11\n"
-      "vpadal.u16 q3, q12\n"
-      "vpadal.u16 q4, q13\n"
-      "vpadal.u16 q5, q14\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {q6}, [%[rhs]:64]\n"
-
-      // Change stride because storing in two ops.
-      "sub %[result_stride], %[result_stride], #8\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-      "vpadd.u32 d8, d8, d9\n"
-      "vpadd.u32 d10, d10, d11\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d1, d4, d4\n"
-      "vpadd.u32 d6, d6, d8\n"
-      "vpadd.u32 d7, d10, d10\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 q0, q0, q6\n"
-      "vadd.s32 q3, q3, q6\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]]!\n"
-      "vst1.32 {d1[0]}, [%[result]], %[result_stride]\n"
-
-      "vst1.32 {d6}, [%[result]]!\n"
-      "vst1.32 {d7[0]}, [%[result]], %[result_stride]\n"
-
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d18", "d19", "d20", "d21",
-        "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "cc", "memory");
-}
-
-inline void mul_3x8_1x8_int32_rhsadd(const std::uint8_t* lhs,
-                                     const std::uint8_t* rhs,
-                                     std::int32_t count, std::int32_t* result,
-                                     std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d6, d7, d8}, [%[lhs]:64]!\n"
-      "vld1.8 {d9}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q5, d9, d6\n"
-      "vmull.u8 q6, d9, d7\n"
-      "vmull.u8 q7, d9, d8\n"
-      "vpadal.u16 q0, q5\n"
-      "vpadal.u16 q1, q6\n"
-      "vpadal.u16 q2, q7\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d0\n"
-      "vpadd.u32 d2, d2, d2\n"
-      "vpadd.u32 d4, d4, d4\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d8\n"
-      "vadd.s32 d2, d2, d8\n"
-      "vadd.s32 d4, d4, d8\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0[0]}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d2[0]}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d4[0]}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "cc", "memory");
-}
-
-inline void mul_3x8_2x8_int32_rhsadd(const std::uint8_t* lhs,
-                                     const std::uint8_t* rhs,
-                                     std::int32_t count, std::int32_t* result,
-                                     std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-      "vmov.i32 q4, q1\n"
-      "vmov.i32 q5, q2\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d12, d13, d14}, [%[lhs]:64]!\n"
-      "vld1.8 {d15, d16}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q9, d15, d12\n"
-      "vmull.u8 q10, d16, d12\n"
-      "vmull.u8 q11, d15, d13\n"
-      "vmull.u8 q12, d16, d13\n"
-      "vmull.u8 q13, d15, d14\n"
-      "vmull.u8 q14, d16, d14\n"
-      "vpadal.u16 q0, q9\n"
-      "vpadal.u16 q1, q10\n"
-      "vpadal.u16 q2, q11\n"
-      "vpadal.u16 q3, q12\n"
-      "vpadal.u16 q4, q13\n"
-      "vpadal.u16 q5, q14\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d12}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-      "vpadd.u32 d8, d8, d9\n"
-      "vpadd.u32 d10, d10, d11\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d4, d4, d6\n"
-      "vpadd.u32 d8, d8, d10\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d12\n"
-      "vadd.s32 d4, d4, d12\n"
-      "vadd.s32 d8, d8, d12\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d4}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d8}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d18", "d19", "d20", "d21",
-        "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "cc", "memory");
-}
-
-inline void mul_3x8_3x8_int32_rhsadd(const std::uint8_t* lhs,
-                                     const std::uint8_t* rhs,
-                                     std::int32_t count, std::int32_t* result,
-                                     std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-      "vmov.i32 q4, q1\n"
-      "vmov.i32 q5, q2\n"
-      "vmov.i32 q6, q3\n"
-      "vmov.i32 q7, q4\n"
-      "vmov.i32 q8, q5\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // 3x3 lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d18, d19, d20}, [%[lhs]:64]!\n"
-      "vld1.8 {d21, d22, d23}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q12, d18, d21\n"
-      "vmull.u8 q13, d18, d22\n"
-      "vmull.u8 q14, d18, d23\n"
-      "vmull.u8 q15, d19, d21\n"
-      "vpadal.u16 q0, q12\n"
-      "vpadal.u16 q1, q13\n"
-      "vpadal.u16 q2, q14\n"
-      "vpadal.u16 q3, q15\n"
-      "vmull.u8 q12, d19, d22\n"
-      "vmull.u8 q13, d19, d23\n"
-      "vmull.u8 q14, d20, d21\n"
-      "vmull.u8 q15, d20, d22\n"
-      "vmull.u8 q9, d20, d23\n"
-      "vpadal.u16 q4, q12\n"
-      "vpadal.u16 q5, q13\n"
-      "vpadal.u16 q6, q14\n"
-      "vpadal.u16 q7, q15\n"
-      "vpadal.u16 q8, q9\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {q9}, [%[rhs]:64]\n"
-
-      // Change stride because storing in two ops.
-      "sub %[result_stride], %[result_stride], #8\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-      "vpadd.u32 d8, d8, d9\n"
-      "vpadd.u32 d10, d10, d11\n"
-      "vpadd.u32 d12, d12, d13\n"
-      "vpadd.u32 d14, d14, d15\n"
-      "vpadd.u32 d16, d16, d17\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d1, d4, d4\n"
-      "vpadd.u32 d6, d6, d8\n"
-      "vpadd.u32 d7, d10, d10\n"
-      "vpadd.u32 d12, d12, d14\n"
-      "vpadd.u32 d13, d16, d16\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 q0, q0, q9\n"
-      "vadd.s32 q3, q3, q9\n"
-      "vadd.s32 q6, q6, q9\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]]!\n"
-      "vst1.32 {d1[0]}, [%[result]], %[result_stride]\n"
-
-      "vst1.32 {d6}, [%[result]]!\n"
-      "vst1.32 {d7[0]}, [%[result]], %[result_stride]\n"
-
-      "vst1.32 {d12}, [%[result]]!\n"
-      "vst1.32 {d13[0]}, [%[result]], %[result_stride]\n"
-
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
-        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
-        "d31", "cc", "memory");
-}
-
-inline void mul_1x8_1x8_int32_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count,
-                                            std::int32_t* result,
-                                            std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d2}, [%[lhs]:64]!\n"
-      "vld1.8 {d3}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q2, d3, d2\n"
-      "vpadal.u16 q0, q2\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 d4, d8[0]\n"
-      "vld1.32 {d9}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d0\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d4\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d9\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0[0]}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d8", "d9", "cc", "memory");
-}
-
-inline void mul_1x8_2x8_int32_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count,
-                                            std::int32_t* result,
-                                            std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d4}, [%[lhs]:64]!\n"
-      "vld1.8 {d5, d6}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q4, d5, d4\n"
-      "vmull.u8 q5, d6, d4\n"
-      "vpadal.u16 q0, q4\n"
-      "vpadal.u16 q1, q5\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 d4, d8[0]\n"
-      "vld1.32 {d9}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d4\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d9\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10", "d11",
-        "cc", "memory");
-}
-
-inline void mul_1x8_3x8_int32_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count,
-                                            std::int32_t* result,
-                                            std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d6}, [%[lhs]:64]!\n"
-      "vld1.8 {d7, d8, d9}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q5, d7, d6\n"
-      "vmull.u8 q6, d8, d6\n"
-      "vmull.u8 q7, d9, d6\n"
-      "vpadal.u16 q0, q5\n"
-      "vpadal.u16 q1, q6\n"
-      "vpadal.u16 q2, q7\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 q5, d8[0]\n"
-      "vld1.32 {q6}, [%[rhs]:64]\n"
-
-      // Change stride because storing in two ops.
-      "sub %[result_stride], %[result_stride], #8\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d1, d4, d4\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 q0, q0, q5\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 q0, q0, q6\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]]!\n"
-      "vst1.32 {d1[0]}, [%[result]], %[result_stride]\n"
-
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "cc", "memory");
-}
-
-inline void mul_2x8_1x8_int32_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count,
-                                            std::int32_t* result,
-                                            std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d4, d5}, [%[lhs]:64]!\n"
-      "vld1.8 {d6}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q4, d6, d4\n"
-      "vmull.u8 q5, d6, d5\n"
-      "vpadal.u16 q0, q4\n"
-      "vpadal.u16 q1, q5\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 d4, d8[0]\n"
-      "vdup.32 d5, d8[1]\n"
-      "vld1.32 {d9}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d0\n"
-      "vpadd.u32 d2, d2, d2\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d4\n"
-      "vadd.s32 d2, d2, d5\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d9\n"
-      "vadd.s32 d2, d2, d9\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0[0]}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d2[0]}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10", "d11",
-        "cc", "memory");
-}
-
-inline void mul_2x8_2x8_int32_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count,
-                                            std::int32_t* result,
-                                            std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d8, d9}, [%[lhs]:64]!\n"
-      "vld1.8 {d10, d11}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q6, d10, d8\n"
-      "vmull.u8 q7, d11, d8\n"
-      "vmull.u8 q8, d10, d9\n"
-      "vmull.u8 q9, d11, d9\n"
-      "vpadal.u16 q0, q6\n"
-      "vpadal.u16 q1, q7\n"
-      "vpadal.u16 q2, q8\n"
-      "vpadal.u16 q3, q9\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 d9, d8[0]\n"
-      "vdup.32 d10, d8[1]\n"
-      "vld1.32 {d11}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d4, d4, d6\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d9\n"
-      "vadd.s32 d4, d4, d10\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d11\n"
-      "vadd.s32 d4, d4, d11\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d4}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "cc",
-        "memory");
-}
-
-inline void mul_2x8_3x8_int32_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count,
-                                            std::int32_t* result,
-                                            std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-      "vmov.i32 q4, q1\n"
-      "vmov.i32 q5, q2\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d12, d13}, [%[lhs]:64]!\n"
-      "vld1.8 {d14, d15, d16}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q9, d14, d12\n"
-      "vmull.u8 q10, d15, d12\n"
-      "vmull.u8 q11, d16, d12\n"
-      "vmull.u8 q12, d14, d13\n"
-      "vmull.u8 q13, d15, d13\n"
-      "vmull.u8 q14, d16, d13\n"
-      "vpadal.u16 q0, q9\n"
-      "vpadal.u16 q1, q10\n"
-      "vpadal.u16 q2, q11\n"
-      "vpadal.u16 q3, q12\n"
-      "vpadal.u16 q4, q13\n"
-      "vpadal.u16 q5, q14\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d12}, [%[lhs]:64]\n"
-      "vdup.32 q7, d12[0]\n"
-      "vdup.32 q8, d12[1]\n"
-      "vld1.32 {q9}, [%[rhs]:64]\n"
-
-      // Change stride because storing in two ops.
-      "sub %[result_stride], %[result_stride], #8\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-      "vpadd.u32 d8, d8, d9\n"
-      "vpadd.u32 d10, d10, d11\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d1, d4, d4\n"
-      "vpadd.u32 d6, d6, d8\n"
-      "vpadd.u32 d7, d10, d10\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 q0, q0, q7\n"
-      "vadd.s32 q3, q3, q8\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 q0, q0, q9\n"
-      "vadd.s32 q3, q3, q9\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]]!\n"
-      "vst1.32 {d1[0]}, [%[result]], %[result_stride]\n"
-
-      "vst1.32 {d6}, [%[result]]!\n"
-      "vst1.32 {d7[0]}, [%[result]], %[result_stride]\n"
-
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
-        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "cc",
-        "memory");
-}
-
-inline void mul_3x8_1x8_int32_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count,
-                                            std::int32_t* result,
-                                            std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d6, d7, d8}, [%[lhs]:64]!\n"
-      "vld1.8 {d9}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q5, d9, d6\n"
-      "vmull.u8 q6, d9, d7\n"
-      "vmull.u8 q7, d9, d8\n"
-      "vpadal.u16 q0, q5\n"
-      "vpadal.u16 q1, q6\n"
-      "vpadal.u16 q2, q7\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {q4}, [%[lhs]:64]\n"
-      "vdup.32 d6, d8[0]\n"
-      "vdup.32 d7, d8[1]\n"
-      "vdup.32 d10, d9[0]\n"
-      "vld1.32 {d11}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d0\n"
-      "vpadd.u32 d2, d2, d2\n"
-      "vpadd.u32 d4, d4, d4\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d6\n"
-      "vadd.s32 d2, d2, d7\n"
-      "vadd.s32 d4, d4, d10\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d11\n"
-      "vadd.s32 d2, d2, d11\n"
-      "vadd.s32 d4, d4, d11\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0[0]}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d2[0]}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d4[0]}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "cc", "memory");
-}
-
-inline void mul_3x8_2x8_int32_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count,
-                                            std::int32_t* result,
-                                            std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-      "vmov.i32 q4, q1\n"
-      "vmov.i32 q5, q2\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d12, d13, d14}, [%[lhs]:64]!\n"
-      "vld1.8 {d15, d16}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q9, d15, d12\n"
-      "vmull.u8 q10, d16, d12\n"
-      "vmull.u8 q11, d15, d13\n"
-      "vmull.u8 q12, d16, d13\n"
-      "vmull.u8 q13, d15, d14\n"
-      "vmull.u8 q14, d16, d14\n"
-      "vpadal.u16 q0, q9\n"
-      "vpadal.u16 q1, q10\n"
-      "vpadal.u16 q2, q11\n"
-      "vpadal.u16 q3, q12\n"
-      "vpadal.u16 q4, q13\n"
-      "vpadal.u16 q5, q14\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {q6}, [%[lhs]:64]\n"
-      "vdup.32 d14, d12[0]\n"
-      "vdup.32 d15, d12[1]\n"
-      "vdup.32 d16, d13[0]\n"
-      "vld1.32 {d17}, [%[rhs]:64]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-      "vpadd.u32 d8, d8, d9\n"
-      "vpadd.u32 d10, d10, d11\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d4, d4, d6\n"
-      "vpadd.u32 d8, d8, d10\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d14\n"
-      "vadd.s32 d4, d4, d15\n"
-      "vadd.s32 d8, d8, d16\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d17\n"
-      "vadd.s32 d4, d4, d17\n"
-      "vadd.s32 d8, d8, d17\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d4}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d8}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
-        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "cc",
-        "memory");
-}
-
-inline void mul_3x8_3x8_int32_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count,
-                                            std::int32_t* result,
-                                            std::int32_t result_stride) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-      "vmov.i32 q4, q1\n"
-      "vmov.i32 q5, q2\n"
-      "vmov.i32 q6, q3\n"
-      "vmov.i32 q7, q4\n"
-      "vmov.i32 q8, q5\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // 3x3 lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d18, d19, d20}, [%[lhs]:64]!\n"
-      "vld1.8 {d21, d22, d23}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q12, d18, d21\n"
-      "vmull.u8 q13, d18, d22\n"
-      "vmull.u8 q14, d18, d23\n"
-      "vmull.u8 q15, d19, d21\n"
-      "vpadal.u16 q0, q12\n"
-      "vpadal.u16 q1, q13\n"
-      "vpadal.u16 q2, q14\n"
-      "vpadal.u16 q3, q15\n"
-      "vmull.u8 q12, d19, d22\n"
-      "vmull.u8 q13, d19, d23\n"
-      "vmull.u8 q14, d20, d21\n"
-      "vmull.u8 q15, d20, d22\n"
-      "vmull.u8 q9, d20, d23\n"
-      "vpadal.u16 q4, q12\n"
-      "vpadal.u16 q5, q13\n"
-      "vpadal.u16 q6, q14\n"
-      "vpadal.u16 q7, q15\n"
-      "vpadal.u16 q8, q9\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {q9}, [%[lhs]:64]\n"
-      "vdup.32 q10, d18[0]\n"
-      "vdup.32 q11, d18[1]\n"
-      "vdup.32 q12, d19[0]\n"
-      "vld1.32 {q13}, [%[rhs]:64]\n"
-
-      // Change stride because storing in two ops.
-      "sub %[result_stride], %[result_stride], #8\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-      "vpadd.u32 d8, d8, d9\n"
-      "vpadd.u32 d10, d10, d11\n"
-      "vpadd.u32 d12, d12, d13\n"
-      "vpadd.u32 d14, d14, d15\n"
-      "vpadd.u32 d16, d16, d17\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d1, d4, d4\n"
-      "vpadd.u32 d6, d6, d8\n"
-      "vpadd.u32 d7, d10, d10\n"
-      "vpadd.u32 d12, d12, d14\n"
-      "vpadd.u32 d13, d16, d16\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 q0, q0, q10\n"
-      "vadd.s32 q3, q3, q11\n"
-      "vadd.s32 q6, q6, q12\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 q0, q0, q13\n"
-      "vadd.s32 q3, q3, q13\n"
-      "vadd.s32 q6, q6, q13\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]]!\n"
-      "vst1.32 {d1[0]}, [%[result]], %[result_stride]\n"
-
-      "vst1.32 {d6}, [%[result]]!\n"
-      "vst1.32 {d7[0]}, [%[result]], %[result_stride]\n"
-
-      "vst1.32 {d12}, [%[result]]!\n"
-      "vst1.32 {d13[0]}, [%[result]], %[result_stride]\n"
-
-      : [count] "+r"(count), [result_stride] "+r"(result_stride),
-        [rhs] "+r"(rhs), [lhs] "+r"(lhs), [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
-        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
-        "d31", "cc", "memory");
-}
-
-inline void mul_1x8_1x8_float_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count, float* result,
-                                            std::int32_t result_stride,
-                                            float result_scale) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d2}, [%[lhs]:64]!\n"
-      "vld1.8 {d3}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q2, d3, d2\n"
-      "vpadal.u16 q0, q2\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 d4, d8[0]\n"
-      "vld1.32 {d9}, [%[rhs]:64]\n"
-      "vdup.32 d5, %[result_scale]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d0\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d4\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d9\n"
-
-      // Convert to float. Multiply by result scale.
-      "vcvt.f32.s32 d0, d0\n"
-      "vmul.f32 d0, d0, d5\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0[0]}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_scale] "+r"(result_scale),
-        [result_stride] "+r"(result_stride), [rhs] "+r"(rhs), [lhs] "+r"(lhs),
-        [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d8", "d9", "cc", "memory");
-}
-
-inline void mul_1x8_2x8_float_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count, float* result,
-                                            std::int32_t result_stride,
-                                            float result_scale) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d4}, [%[lhs]:64]!\n"
-      "vld1.8 {d5, d6}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q4, d5, d4\n"
-      "vmull.u8 q5, d6, d4\n"
-      "vpadal.u16 q0, q4\n"
-      "vpadal.u16 q1, q5\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 d4, d8[0]\n"
-      "vld1.32 {d9}, [%[rhs]:64]\n"
-      "vdup.32 d5, %[result_scale]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d4\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d9\n"
-
-      // Convert to float. Multiply by result scale.
-      "vcvt.f32.s32 d0, d0\n"
-      "vmul.f32 d0, d0, d5\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_scale] "+r"(result_scale),
-        [result_stride] "+r"(result_stride), [rhs] "+r"(rhs), [lhs] "+r"(lhs),
-        [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10", "d11",
-        "cc", "memory");
-}
-
-inline void mul_1x8_3x8_float_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count, float* result,
-                                            std::int32_t result_stride,
-                                            float result_scale) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d6}, [%[lhs]:64]!\n"
-      "vld1.8 {d7, d8, d9}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q5, d7, d6\n"
-      "vmull.u8 q6, d8, d6\n"
-      "vmull.u8 q7, d9, d6\n"
-      "vpadal.u16 q0, q5\n"
-      "vpadal.u16 q1, q6\n"
-      "vpadal.u16 q2, q7\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 q5, d8[0]\n"
-      "vld1.32 {q6}, [%[rhs]:64]\n"
-      "vdup.32 q7, %[result_scale]\n"
-
-      // Change stride because storing in two ops.
-      "sub %[result_stride], %[result_stride], #8\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d1, d4, d4\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 q0, q0, q5\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 q0, q0, q6\n"
-
-      // Convert to float. Multiply by result scale.
-      "vcvt.f32.s32 q0, q0\n"
-      "vmul.f32 q0, q0, q7\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]]!\n"
-      "vst1.32 {d1[0]}, [%[result]], %[result_stride]\n"
-
-      : [count] "+r"(count), [result_scale] "+r"(result_scale),
-        [result_stride] "+r"(result_stride), [rhs] "+r"(rhs), [lhs] "+r"(lhs),
-        [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "cc", "memory");
-}
-
-inline void mul_2x8_1x8_float_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count, float* result,
-                                            std::int32_t result_stride,
-                                            float result_scale) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d4, d5}, [%[lhs]:64]!\n"
-      "vld1.8 {d6}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q4, d6, d4\n"
-      "vmull.u8 q5, d6, d5\n"
-      "vpadal.u16 q0, q4\n"
-      "vpadal.u16 q1, q5\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 d4, d8[0]\n"
-      "vdup.32 d5, d8[1]\n"
-      "vld1.32 {d9}, [%[rhs]:64]\n"
-      "vdup.32 d6, %[result_scale]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d0\n"
-      "vpadd.u32 d2, d2, d2\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d4\n"
-      "vadd.s32 d2, d2, d5\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d9\n"
-      "vadd.s32 d2, d2, d9\n"
-
-      // Convert to float. Multiply by result scale.
-      "vcvt.f32.s32 d0, d0\n"
-      "vcvt.f32.s32 d2, d2\n"
-      "vmul.f32 d0, d0, d6\n"
-      "vmul.f32 d2, d2, d6\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0[0]}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d2[0]}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_scale] "+r"(result_scale),
-        [result_stride] "+r"(result_stride), [rhs] "+r"(rhs), [lhs] "+r"(lhs),
-        [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d8", "d9", "d10", "d11",
-        "cc", "memory");
-}
-
-inline void mul_2x8_2x8_float_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count, float* result,
-                                            std::int32_t result_stride,
-                                            float result_scale) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d8, d9}, [%[lhs]:64]!\n"
-      "vld1.8 {d10, d11}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q6, d10, d8\n"
-      "vmull.u8 q7, d11, d8\n"
-      "vmull.u8 q8, d10, d9\n"
-      "vmull.u8 q9, d11, d9\n"
-      "vpadal.u16 q0, q6\n"
-      "vpadal.u16 q1, q7\n"
-      "vpadal.u16 q2, q8\n"
-      "vpadal.u16 q3, q9\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d8}, [%[lhs]:64]\n"
-      "vdup.32 d9, d8[0]\n"
-      "vdup.32 d10, d8[1]\n"
-      "vld1.32 {d11}, [%[rhs]:64]\n"
-      "vdup.32 d12, %[result_scale]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d4, d4, d6\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d9\n"
-      "vadd.s32 d4, d4, d10\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d11\n"
-      "vadd.s32 d4, d4, d11\n"
-
-      // Convert to float. Multiply by result scale.
-      "vcvt.f32.s32 d0, d0\n"
-      "vcvt.f32.s32 d4, d4\n"
-      "vmul.f32 d0, d0, d12\n"
-      "vmul.f32 d4, d4, d12\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d4}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_scale] "+r"(result_scale),
-        [result_stride] "+r"(result_stride), [rhs] "+r"(rhs), [lhs] "+r"(lhs),
-        [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "cc",
-        "memory");
-}
-
-inline void mul_2x8_3x8_float_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count, float* result,
-                                            std::int32_t result_stride,
-                                            float result_scale) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-      "vmov.i32 q4, q1\n"
-      "vmov.i32 q5, q2\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d12, d13}, [%[lhs]:64]!\n"
-      "vld1.8 {d14, d15, d16}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q9, d14, d12\n"
-      "vmull.u8 q10, d15, d12\n"
-      "vmull.u8 q11, d16, d12\n"
-      "vmull.u8 q12, d14, d13\n"
-      "vmull.u8 q13, d15, d13\n"
-      "vmull.u8 q14, d16, d13\n"
-      "vpadal.u16 q0, q9\n"
-      "vpadal.u16 q1, q10\n"
-      "vpadal.u16 q2, q11\n"
-      "vpadal.u16 q3, q12\n"
-      "vpadal.u16 q4, q13\n"
-      "vpadal.u16 q5, q14\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {d12}, [%[lhs]:64]\n"
-      "vdup.32 q7, d12[0]\n"
-      "vdup.32 q8, d12[1]\n"
-      "vld1.32 {q9}, [%[rhs]:64]\n"
-      "vdup.32 q10, %[result_scale]\n"
-
-      // Change stride because storing in two ops.
-      "sub %[result_stride], %[result_stride], #8\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-      "vpadd.u32 d8, d8, d9\n"
-      "vpadd.u32 d10, d10, d11\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d1, d4, d4\n"
-      "vpadd.u32 d6, d6, d8\n"
-      "vpadd.u32 d7, d10, d10\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 q0, q0, q7\n"
-      "vadd.s32 q3, q3, q8\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 q0, q0, q9\n"
-      "vadd.s32 q3, q3, q9\n"
-
-      // Convert to float. Multiply by result scale.
-      "vcvt.f32.s32 q0, q0\n"
-      "vcvt.f32.s32 q3, q3\n"
-      "vmul.f32 q0, q0, q10\n"
-      "vmul.f32 q3, q3, q10\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]]!\n"
-      "vst1.32 {d1[0]}, [%[result]], %[result_stride]\n"
-
-      "vst1.32 {d6}, [%[result]]!\n"
-      "vst1.32 {d7[0]}, [%[result]], %[result_stride]\n"
-
-      : [count] "+r"(count), [result_scale] "+r"(result_scale),
-        [result_stride] "+r"(result_stride), [rhs] "+r"(rhs), [lhs] "+r"(lhs),
-        [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
-        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "cc",
-        "memory");
-}
-
-inline void mul_3x8_1x8_float_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count, float* result,
-                                            std::int32_t result_stride,
-                                            float result_scale) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d6, d7, d8}, [%[lhs]:64]!\n"
-      "vld1.8 {d9}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q5, d9, d6\n"
-      "vmull.u8 q6, d9, d7\n"
-      "vmull.u8 q7, d9, d8\n"
-      "vpadal.u16 q0, q5\n"
-      "vpadal.u16 q1, q6\n"
-      "vpadal.u16 q2, q7\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {q4}, [%[lhs]:64]\n"
-      "vdup.32 d6, d8[0]\n"
-      "vdup.32 d7, d8[1]\n"
-      "vdup.32 d10, d9[0]\n"
-      "vld1.32 {d11}, [%[rhs]:64]\n"
-      "vdup.32 d12, %[result_scale]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d0\n"
-      "vpadd.u32 d2, d2, d2\n"
-      "vpadd.u32 d4, d4, d4\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d6\n"
-      "vadd.s32 d2, d2, d7\n"
-      "vadd.s32 d4, d4, d10\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d11\n"
-      "vadd.s32 d2, d2, d11\n"
-      "vadd.s32 d4, d4, d11\n"
-
-      // Convert to float. Multiply by result scale.
-      "vcvt.f32.s32 d0, d0\n"
-      "vcvt.f32.s32 d2, d2\n"
-      "vcvt.f32.s32 d4, d4\n"
-      "vmul.f32 d0, d0, d12\n"
-      "vmul.f32 d2, d2, d12\n"
-      "vmul.f32 d4, d4, d12\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0[0]}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d2[0]}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d4[0]}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_scale] "+r"(result_scale),
-        [result_stride] "+r"(result_stride), [rhs] "+r"(rhs), [lhs] "+r"(lhs),
-        [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "cc", "memory");
-}
-
-inline void mul_3x8_2x8_float_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count, float* result,
-                                            std::int32_t result_stride,
-                                            float result_scale) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-      "vmov.i32 q4, q1\n"
-      "vmov.i32 q5, q2\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // General NxM lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d12, d13, d14}, [%[lhs]:64]!\n"
-      "vld1.8 {d15, d16}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q9, d15, d12\n"
-      "vmull.u8 q10, d16, d12\n"
-      "vmull.u8 q11, d15, d13\n"
-      "vmull.u8 q12, d16, d13\n"
-      "vmull.u8 q13, d15, d14\n"
-      "vmull.u8 q14, d16, d14\n"
-      "vpadal.u16 q0, q9\n"
-      "vpadal.u16 q1, q10\n"
-      "vpadal.u16 q2, q11\n"
-      "vpadal.u16 q3, q12\n"
-      "vpadal.u16 q4, q13\n"
-      "vpadal.u16 q5, q14\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {q6}, [%[lhs]:64]\n"
-      "vdup.32 d14, d12[0]\n"
-      "vdup.32 d15, d12[1]\n"
-      "vdup.32 d16, d13[0]\n"
-      "vld1.32 {d17}, [%[rhs]:64]\n"
-      "vdup.32 d18, %[result_scale]\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-      "vpadd.u32 d8, d8, d9\n"
-      "vpadd.u32 d10, d10, d11\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d4, d4, d6\n"
-      "vpadd.u32 d8, d8, d10\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 d0, d0, d14\n"
-      "vadd.s32 d4, d4, d15\n"
-      "vadd.s32 d8, d8, d16\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 d0, d0, d17\n"
-      "vadd.s32 d4, d4, d17\n"
-      "vadd.s32 d8, d8, d17\n"
-
-      // Convert to float. Multiply by result scale.
-      "vcvt.f32.s32 d0, d0\n"
-      "vcvt.f32.s32 d4, d4\n"
-      "vcvt.f32.s32 d8, d8\n"
-      "vmul.f32 d0, d0, d18\n"
-      "vmul.f32 d4, d4, d18\n"
-      "vmul.f32 d8, d8, d18\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d4}, [%[result]], %[result_stride]\n"
-      "vst1.32 {d8}, [%[result]], %[result_stride]\n"
-      : [count] "+r"(count), [result_scale] "+r"(result_scale),
-        [result_stride] "+r"(result_stride), [rhs] "+r"(rhs), [lhs] "+r"(lhs),
-        [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
-        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "cc",
-        "memory");
-}
-
-inline void mul_3x8_3x8_float_lhsadd_rhsadd(const std::uint8_t* lhs,
-                                            const std::uint8_t* rhs,
-                                            std::int32_t count, float* result,
-                                            std::int32_t result_stride,
-                                            float result_scale) {
-  asm volatile(
-      // Clear aggregators.
-      "vmov.i32 q0, #0\n"
-      "vmov.i32 q1, #0\n"
-      "vmov.i32 q2, #0\n"
-      "vmov.i32 q3, q0\n"
-      "vmov.i32 q4, q1\n"
-      "vmov.i32 q5, q2\n"
-      "vmov.i32 q6, q3\n"
-      "vmov.i32 q7, q4\n"
-      "vmov.i32 q8, q5\n"
-
-      "pld [%[lhs]]\n"
-      "pld [%[rhs]]\n"
-      // 3x3 lanes loop.
-      "1:"
-
-      // Subtract counter.
-      "subs %[count], %[count], #8\n"
-
-      "vld1.8 {d18, d19, d20}, [%[lhs]:64]!\n"
-      "vld1.8 {d21, d22, d23}, [%[rhs]:64]!\n"
-      "pld [%[lhs], #64]\n"
-      "pld [%[rhs], #64]\n"
-      "vmull.u8 q12, d18, d21\n"
-      "vmull.u8 q13, d18, d22\n"
-      "vmull.u8 q14, d18, d23\n"
-      "vmull.u8 q15, d19, d21\n"
-      "vpadal.u16 q0, q12\n"
-      "vpadal.u16 q1, q13\n"
-      "vpadal.u16 q2, q14\n"
-      "vpadal.u16 q3, q15\n"
-      "vmull.u8 q12, d19, d22\n"
-      "vmull.u8 q13, d19, d23\n"
-      "vmull.u8 q14, d20, d21\n"
-      "vmull.u8 q15, d20, d22\n"
-      "vmull.u8 q9, d20, d23\n"
-      "vpadal.u16 q4, q12\n"
-      "vpadal.u16 q5, q13\n"
-      "vpadal.u16 q6, q14\n"
-      "vpadal.u16 q7, q15\n"
-      "vpadal.u16 q8, q9\n"
-
-      // Loop break.
-      "bne 1b\n"
-
-      "vld1.32 {q9}, [%[lhs]:64]\n"
-      "vdup.32 q10, d18[0]\n"
-      "vdup.32 q11, d18[1]\n"
-      "vdup.32 q12, d19[0]\n"
-      "vld1.32 {q13}, [%[rhs]:64]\n"
-      "vdup.32 q14, %[result_scale]\n"
-
-      // Change stride because storing in two ops.
-      "sub %[result_stride], %[result_stride], #8\n"
-
-      // Horizontal reduce aggregators.
-      "vpadd.u32 d0, d0, d1\n"
-      "vpadd.u32 d2, d2, d3\n"
-      "vpadd.u32 d4, d4, d5\n"
-      "vpadd.u32 d6, d6, d7\n"
-      "vpadd.u32 d8, d8, d9\n"
-      "vpadd.u32 d10, d10, d11\n"
-      "vpadd.u32 d12, d12, d13\n"
-      "vpadd.u32 d14, d14, d15\n"
-      "vpadd.u32 d16, d16, d17\n"
-
-      // Reduce rows.
-      "vpadd.u32 d0, d0, d2\n"
-      "vpadd.u32 d1, d4, d4\n"
-      "vpadd.u32 d6, d6, d8\n"
-      "vpadd.u32 d7, d10, d10\n"
-      "vpadd.u32 d12, d12, d14\n"
-      "vpadd.u32 d13, d16, d16\n"
-
-      // Add lhs offsets to aggregated rows.
-      "vadd.s32 q0, q0, q10\n"
-      "vadd.s32 q3, q3, q11\n"
-      "vadd.s32 q6, q6, q12\n"
-
-      // Add rhs offset to aggregated rows.
-      "vadd.s32 q0, q0, q13\n"
-      "vadd.s32 q3, q3, q13\n"
-      "vadd.s32 q6, q6, q13\n"
-
-      // Convert to float. Multiply by result scale.
-      "vcvt.f32.s32 q0, q0\n"
-      "vcvt.f32.s32 q3, q3\n"
-      "vcvt.f32.s32 q6, q6\n"
-      "vmul.f32 q0, q0, q14\n"
-      "vmul.f32 q3, q3, q14\n"
-      "vmul.f32 q6, q6, q14\n"
-
-      // Store reduced rows.
-      "vst1.32 {d0}, [%[result]]!\n"
-      "vst1.32 {d1[0]}, [%[result]], %[result_stride]\n"
-
-      "vst1.32 {d6}, [%[result]]!\n"
-      "vst1.32 {d7[0]}, [%[result]], %[result_stride]\n"
-
-      "vst1.32 {d12}, [%[result]]!\n"
-      "vst1.32 {d13[0]}, [%[result]], %[result_stride]\n"
-
-      : [count] "+r"(count), [result_scale] "+r"(result_scale),
-        [result_stride] "+r"(result_stride), [rhs] "+r"(rhs), [lhs] "+r"(lhs),
-        [result] "+r"(result)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20",
-        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30",
-        "d31", "cc", "memory");
-}
-
-void qnt_1x8_aligned(const std::int32_t* source, std::int32_t count,
-                     std::int32_t stride, const std::int32_t* offsets,
-                     std::uint8_t* destination, std::int32_t destination_stride,
-                     std::int32_t multiplicative_offset,
-                     std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_1_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #1\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8[0]}, [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_2_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #2\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8}, [%[source]:64]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.16 {d12[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_3_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #3\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8}, [%[source]:64]!\n"
-      "vld1.32 {d9[0]}, [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.16 {d12[0]}, [%[destination]]!\n"
-      "vst1.8 {d12[2]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_4_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #4\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8, d9}, [%[source]:64]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.32 {d12[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_5_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #5\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8, d9}, [%[source]:64]!\n"
-      "vld1.32 {d10[0]}, [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.32 {d12[0]}, [%[destination]]!\n"
-      "vst1.8 {d12[4]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_6_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #6\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8, d9, d10}, [%[source]:64]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.32 {d12[0]}, [%[destination]]!\n"
-      "vst1.16 {d12[2]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_7_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #7\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8, d9, d10}, [%[source]:64]!\n"
-      "vld1.32 {d11[0]}, [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.32 {d12[0]}, [%[destination]]!\n"
-      "vst1.16 {d12[2]}, [%[destination]]!\n"
-      "vst1.8 {d12[6]}, [%[destination]]!\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_2x8_aligned(const std::int32_t* source, std::int32_t count,
-                     std::int32_t stride, const std::int32_t* offsets,
-                     std::uint8_t* destination, std::int32_t destination_stride,
-                     std::int32_t multiplicative_offset,
-                     std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]:64]!\n"
-      "vst1.8 {d20}, [r1:64]!\n"
-
-      "bne 1b\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_1_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #1\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]:64]!\n"
-      "vst1.8 {d20}, [r1:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10[0]}, [%[source]]\n"
-      "vld1.32 {d14[0]}, [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18[0]}, [%[destination]]\n"
-      "vst1.8 {d20[0]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_2_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #2\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]:64]!\n"
-      "vst1.8 {d20}, [r1:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10}, [%[source]:64]\n"
-      "vld1.32 {d14}, [r0:64]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.16 {d18[0]}, [%[destination]]\n"
-      "vst1.16 {d20[0]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_3_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #3\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]:64]!\n"
-      "vst1.8 {d20}, [r1:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10}, [%[source]:64]!\n"
-      "vld1.32 {d14}, [r0:64]!\n"
-      "vld1.32 {d11[0]}, [%[source]]\n"
-      "vld1.32 {d15[0]}, [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.16 {d18[0]}, [%[destination]]!\n"
-      "vst1.16 {d20[0]}, [r1]!\n"
-      "vst1.8 {d18[2]}, [%[destination]]\n"
-      "vst1.8 {d20[2]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_4_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #4\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]:64]!\n"
-      "vst1.8 {d20}, [r1:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10, d11}, [%[source]:64]\n"
-      "vld1.32 {d14, d15}, [r0:64]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.32 {d18[0]}, [%[destination]]\n"
-      "vst1.32 {d20[0]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_5_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #5\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]:64]!\n"
-      "vst1.8 {d20}, [r1:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10, d11}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15}, [r0:64]!\n"
-      "vld1.32 {d12[0]}, [%[source]]\n"
-      "vld1.32 {d16[0]}, [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.32 {d18[0]}, [%[destination]]!\n"
-      "vst1.32 {d20[0]}, [r1]!\n"
-      "vst1.8 {d18[4]}, [%[destination]]\n"
-      "vst1.8 {d20[4]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_6_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #6\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]:64]!\n"
-      "vst1.8 {d20}, [r1:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10, d11, d12}, [%[source]:64]\n"
-      "vld1.32 {d14, d15, d16}, [r0:64]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.32 {d18[0]}, [%[destination]]!\n"
-      "vst1.32 {d20[0]}, [r1]!\n"
-      "vst1.16 {d18[2]}, [%[destination]]\n"
-      "vst1.16 {d20[2]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_7_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #7\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]:64]!\n"
-      "vst1.8 {d20}, [r1:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10, d11, d12}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16}, [r0:64]!\n"
-      "vld1.32 {d13[0]}, [%[source]]\n"
-      "vld1.32 {d17[0]}, [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.32 {d18[0]}, [%[destination]]!\n"
-      "vst1.32 {d20[0]}, [r1]!\n"
-      "vst1.16 {d18[2]}, [%[destination]]!\n"
-      "vst1.16 {d20[2]}, [r1]!\n"
-      "vst1.8 {d18[6]}, [%[destination]]!\n"
-      "vst1.8 {d20[6]}, [r1]!\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_3x8_aligned(const std::int32_t* source, std::int32_t count,
-                     std::int32_t stride, const std::int32_t* offsets,
-                     std::uint8_t* destination, std::int32_t destination_stride,
-                     std::int32_t multiplicative_offset,
-                     std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]:64]!\n"
-      "vst1.8 {d26}, [r1:64]!\n"
-      "vst1.8 {d28}, [r3:64]!\n"
-
-      "bne 1b\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_1_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #1\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]:64]!\n"
-      "vst1.8 {d26}, [r1:64]!\n"
-      "vst1.8 {d28}, [r3:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12[0]}, [%[source]]\n"
-      "vld1.32 {d16[0]}, [r0]\n"
-      "vld1.32 {d20[0]}, [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24[0]}, [%[destination]]\n"
-      "vst1.8 {d26[0]}, [r1]\n"
-      "vst1.8 {d28[0]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_2_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #2\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]:64]!\n"
-      "vst1.8 {d26}, [r1:64]!\n"
-      "vst1.8 {d28}, [r3:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12}, [%[source]:64]\n"
-      "vld1.32 {d16}, [r0:64]\n"
-      "vld1.32 {d20}, [r2:64]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.16 {d24[0]}, [%[destination]]\n"
-      "vst1.16 {d26[0]}, [r1]\n"
-      "vst1.16 {d28[0]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_3_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #3\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]:64]!\n"
-      "vst1.8 {d26}, [r1:64]!\n"
-      "vst1.8 {d28}, [r3:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12}, [%[source]:64]!\n"
-      "vld1.32 {d16}, [r0:64]!\n"
-      "vld1.32 {d20}, [r2:64]!\n"
-      "vld1.32 {d13[0]}, [%[source]]\n"
-      "vld1.32 {d17[0]}, [r0]\n"
-      "vld1.32 {d21[0]}, [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.16 {d24[0]}, [%[destination]]!\n"
-      "vst1.16 {d26[0]}, [r1]!\n"
-      "vst1.16 {d28[0]}, [r3]!\n"
-      "vst1.8 {d24[2]}, [%[destination]]\n"
-      "vst1.8 {d26[2]}, [r1]\n"
-      "vst1.8 {d28[2]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_4_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #4\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]:64]!\n"
-      "vst1.8 {d26}, [r1:64]!\n"
-      "vst1.8 {d28}, [r3:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12, d13}, [%[source]:64]\n"
-      "vld1.32 {d16, d17}, [r0:64]\n"
-      "vld1.32 {d20, d21}, [r2:64]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.32 {d24[0]}, [%[destination]]\n"
-      "vst1.32 {d26[0]}, [r1]\n"
-      "vst1.32 {d28[0]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_5_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #5\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]:64]!\n"
-      "vst1.8 {d26}, [r1:64]!\n"
-      "vst1.8 {d28}, [r3:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17}, [r0:64]!\n"
-      "vld1.32 {d20, d21}, [r2:64]!\n"
-      "vld1.32 {d14[0]}, [%[source]]\n"
-      "vld1.32 {d18[0]}, [r0]\n"
-      "vld1.32 {d22[0]}, [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.32 {d24[0]}, [%[destination]]!\n"
-      "vst1.32 {d26[0]}, [r1]!\n"
-      "vst1.32 {d28[0]}, [r3]!\n"
-      "vst1.8 {d24[4]}, [%[destination]]\n"
-      "vst1.8 {d26[4]}, [r1]\n"
-      "vst1.8 {d28[4]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_6_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #6\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]:64]!\n"
-      "vst1.8 {d26}, [r1:64]!\n"
-      "vst1.8 {d28}, [r3:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12, d13, d14}, [%[source]:64]\n"
-      "vld1.32 {d16, d17, d18}, [r0:64]\n"
-      "vld1.32 {d20, d21, d22}, [r2:64]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.32 {d24[0]}, [%[destination]]!\n"
-      "vst1.32 {d26[0]}, [r1]!\n"
-      "vst1.32 {d28[0]}, [r3]!\n"
-      "vst1.16 {d24[2]}, [%[destination]]\n"
-      "vst1.16 {d26[2]}, [r1]\n"
-      "vst1.16 {d28[2]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_7_aligned(const std::int32_t* source, std::int32_t count,
-                       std::int32_t stride, const std::int32_t* offsets,
-                       std::uint8_t* destination,
-                       std::int32_t destination_stride,
-                       std::int32_t multiplicative_offset,
-                       std::int32_t rounding_offset, std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #7\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]:64]!\n"
-      "vst1.8 {d26}, [r1:64]!\n"
-      "vst1.8 {d28}, [r3:64]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12, d13, d14}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22}, [r2:64]!\n"
-      "vld1.32 {d15[0]}, [%[source]]\n"
-      "vld1.32 {d19[0]}, [r0]\n"
-      "vld1.32 {d23[0]}, [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.32 {d24[0]}, [%[destination]]!\n"
-      "vst1.32 {d26[0]}, [r1]!\n"
-      "vst1.32 {d28[0]}, [r3]!\n"
-      "vst1.16 {d24[2]}, [%[destination]]!\n"
-      "vst1.16 {d26[2]}, [r1]!\n"
-      "vst1.16 {d28[2]}, [r3]!\n"
-      "vst1.8 {d24[6]}, [%[destination]]!\n"
-      "vst1.8 {d26[6]}, [r1]!\n"
-      "vst1.8 {d28[6]}, [r3]!\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_1x8(const std::int32_t* source, std::int32_t count,
-             std::int32_t stride, const std::int32_t* offsets,
-             std::uint8_t* destination, std::int32_t destination_stride,
-             std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-             std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]]!\n"
-
-      "bne 1b\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_1(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #1\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8[0]}, [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_2(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #2\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8}, [%[source]:64]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.16 {d12[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_3(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #3\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8}, [%[source]:64]!\n"
-      "vld1.32 {d9[0]}, [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.16 {d12[0]}, [%[destination]]!\n"
-      "vst1.8 {d12[2]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_4(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #4\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8, d9}, [%[source]:64]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.32 {d12[0]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_5(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #5\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8, d9}, [%[source]:64]!\n"
-      "vld1.32 {d10[0]}, [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.32 {d12[0]}, [%[destination]]!\n"
-      "vst1.8 {d12[4]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_6(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #6\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8, d9, d10}, [%[source]:64]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.32 {d12[0]}, [%[destination]]!\n"
-      "vst1.16 {d12[2]}, [%[destination]]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_1x8_7(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "subs %[count], %[count], #7\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d8, d9, d10, d11}, [%[source]:64]!\n"
-      "pld [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.8 {d12}, [%[destination]]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d8, d9, d10}, [%[source]:64]!\n"
-      "vld1.32 {d11[0]}, [%[source]]\n"
-      "vadd.i32 q4, q4, q3\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vmul.i32 q4, q4, q0\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vadd.i32 q4, q4, q1\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vshl.s32 q4, q4, q2\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vqmovn.s32 d12, q4\n"
-      "vqmovn.s32 d13, q5\n"
-      "vqmovun.s16 d12, q6\n"
-      "vst1.32 {d12[0]}, [%[destination]]!\n"
-      "vst1.16 {d12[2]}, [%[destination]]!\n"
-      "vst1.8 {d12[6]}, [%[destination]]!\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
-        "d11", "d12", "d13", "cc", "memory");
-}
-
-void qnt_2x8(const std::int32_t* source, std::int32_t count,
-             std::int32_t stride, const std::int32_t* offsets,
-             std::uint8_t* destination, std::int32_t destination_stride,
-             std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-             std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]]!\n"
-      "vst1.8 {d20}, [r1]!\n"
-
-      "bne 1b\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_1(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #1\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]]!\n"
-      "vst1.8 {d20}, [r1]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10[0]}, [%[source]]\n"
-      "vld1.32 {d14[0]}, [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18[0]}, [%[destination]]\n"
-      "vst1.8 {d20[0]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_2(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #2\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]]!\n"
-      "vst1.8 {d20}, [r1]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10}, [%[source]:64]\n"
-      "vld1.32 {d14}, [r0:64]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.16 {d18[0]}, [%[destination]]\n"
-      "vst1.16 {d20[0]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_3(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #3\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]]!\n"
-      "vst1.8 {d20}, [r1]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10}, [%[source]:64]!\n"
-      "vld1.32 {d14}, [r0:64]!\n"
-      "vld1.32 {d11[0]}, [%[source]]\n"
-      "vld1.32 {d15[0]}, [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.16 {d18[0]}, [%[destination]]!\n"
-      "vst1.16 {d20[0]}, [r1]!\n"
-      "vst1.8 {d18[2]}, [%[destination]]\n"
-      "vst1.8 {d20[2]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_4(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #4\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]]!\n"
-      "vst1.8 {d20}, [r1]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10, d11}, [%[source]:64]\n"
-      "vld1.32 {d14, d15}, [r0:64]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.32 {d18[0]}, [%[destination]]\n"
-      "vst1.32 {d20[0]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_5(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #5\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]]!\n"
-      "vst1.8 {d20}, [r1]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10, d11}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15}, [r0:64]!\n"
-      "vld1.32 {d12[0]}, [%[source]]\n"
-      "vld1.32 {d16[0]}, [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.32 {d18[0]}, [%[destination]]!\n"
-      "vst1.32 {d20[0]}, [r1]!\n"
-      "vst1.8 {d18[4]}, [%[destination]]\n"
-      "vst1.8 {d20[4]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_6(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #6\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]]!\n"
-      "vst1.8 {d20}, [r1]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10, d11, d12}, [%[source]:64]\n"
-      "vld1.32 {d14, d15, d16}, [r0:64]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.32 {d18[0]}, [%[destination]]!\n"
-      "vst1.32 {d20[0]}, [r1]!\n"
-      "vst1.16 {d18[2]}, [%[destination]]\n"
-      "vst1.16 {d20[2]}, [r1]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_2x8_7(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "subs %[count], %[count], #7\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d10, d11, d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16, d17}, [r0:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.8 {d18}, [%[destination]]!\n"
-      "vst1.8 {d20}, [r1]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d10, d11, d12}, [%[source]:64]!\n"
-      "vld1.32 {d14, d15, d16}, [r0:64]!\n"
-      "vld1.32 {d13[0]}, [%[source]]\n"
-      "vld1.32 {d17[0]}, [r0]\n"
-      "vadd.i32 q5, q5, q3\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q4\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vmul.i32 q5, q5, q0\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vadd.i32 q5, q5, q1\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vshl.s32 q5, q5, q2\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vqmovn.s32 d18, q5\n"
-      "vqmovn.s32 d19, q6\n"
-      "vqmovn.s32 d20, q7\n"
-      "vqmovn.s32 d21, q8\n"
-      "vqmovun.s16 d18, q9\n"
-      "vqmovun.s16 d20, q10\n"
-      "vst1.32 {d18[0]}, [%[destination]]!\n"
-      "vst1.32 {d20[0]}, [r1]!\n"
-      "vst1.16 {d18[2]}, [%[destination]]!\n"
-      "vst1.16 {d20[2]}, [r1]!\n"
-      "vst1.8 {d18[6]}, [%[destination]]!\n"
-      "vst1.8 {d20[6]}, [r1]!\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
-        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
-        "d20", "d21", "cc", "memory");
-}
-
-void qnt_3x8(const std::int32_t* source, std::int32_t count,
-             std::int32_t stride, const std::int32_t* offsets,
-             std::uint8_t* destination, std::int32_t destination_stride,
-             std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-             std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]]!\n"
-      "vst1.8 {d26}, [r1]!\n"
-      "vst1.8 {d28}, [r3]!\n"
-
-      "bne 1b\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_1(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #1\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]]!\n"
-      "vst1.8 {d26}, [r1]!\n"
-      "vst1.8 {d28}, [r3]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12[0]}, [%[source]]\n"
-      "vld1.32 {d16[0]}, [r0]\n"
-      "vld1.32 {d20[0]}, [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24[0]}, [%[destination]]\n"
-      "vst1.8 {d26[0]}, [r1]\n"
-      "vst1.8 {d28[0]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_2(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #2\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]]!\n"
-      "vst1.8 {d26}, [r1]!\n"
-      "vst1.8 {d28}, [r3]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12}, [%[source]:64]\n"
-      "vld1.32 {d16}, [r0:64]\n"
-      "vld1.32 {d20}, [r2:64]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.16 {d24[0]}, [%[destination]]\n"
-      "vst1.16 {d26[0]}, [r1]\n"
-      "vst1.16 {d28[0]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_3(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #3\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]]!\n"
-      "vst1.8 {d26}, [r1]!\n"
-      "vst1.8 {d28}, [r3]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12}, [%[source]:64]!\n"
-      "vld1.32 {d16}, [r0:64]!\n"
-      "vld1.32 {d20}, [r2:64]!\n"
-      "vld1.32 {d13[0]}, [%[source]]\n"
-      "vld1.32 {d17[0]}, [r0]\n"
-      "vld1.32 {d21[0]}, [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.16 {d24[0]}, [%[destination]]!\n"
-      "vst1.16 {d26[0]}, [r1]!\n"
-      "vst1.16 {d28[0]}, [r3]!\n"
-      "vst1.8 {d24[2]}, [%[destination]]\n"
-      "vst1.8 {d26[2]}, [r1]\n"
-      "vst1.8 {d28[2]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_4(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #4\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]]!\n"
-      "vst1.8 {d26}, [r1]!\n"
-      "vst1.8 {d28}, [r3]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12, d13}, [%[source]:64]\n"
-      "vld1.32 {d16, d17}, [r0:64]\n"
-      "vld1.32 {d20, d21}, [r2:64]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.32 {d24[0]}, [%[destination]]\n"
-      "vst1.32 {d26[0]}, [r1]\n"
-      "vst1.32 {d28[0]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_5(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #5\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]]!\n"
-      "vst1.8 {d26}, [r1]!\n"
-      "vst1.8 {d28}, [r3]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12, d13}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17}, [r0:64]!\n"
-      "vld1.32 {d20, d21}, [r2:64]!\n"
-      "vld1.32 {d14[0]}, [%[source]]\n"
-      "vld1.32 {d18[0]}, [r0]\n"
-      "vld1.32 {d22[0]}, [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.32 {d24[0]}, [%[destination]]!\n"
-      "vst1.32 {d26[0]}, [r1]!\n"
-      "vst1.32 {d28[0]}, [r3]!\n"
-      "vst1.8 {d24[4]}, [%[destination]]\n"
-      "vst1.8 {d26[4]}, [r1]\n"
-      "vst1.8 {d28[4]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_6(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #6\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]]!\n"
-      "vst1.8 {d26}, [r1]!\n"
-      "vst1.8 {d28}, [r3]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12, d13, d14}, [%[source]:64]\n"
-      "vld1.32 {d16, d17, d18}, [r0:64]\n"
-      "vld1.32 {d20, d21, d22}, [r2:64]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.32 {d24[0]}, [%[destination]]!\n"
-      "vst1.32 {d26[0]}, [r1]!\n"
-      "vst1.32 {d28[0]}, [r3]!\n"
-      "vst1.16 {d24[2]}, [%[destination]]\n"
-      "vst1.16 {d26[2]}, [r1]\n"
-      "vst1.16 {d28[2]}, [r3]\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void qnt_3x8_7(const std::int32_t* source, std::int32_t count,
-               std::int32_t stride, const std::int32_t* offsets,
-               std::uint8_t* destination, std::int32_t destination_stride,
-               std::int32_t multiplicative_offset, std::int32_t rounding_offset,
-               std::int32_t shift) {
-  asm volatile(
-      "vdup.32 q0, %[multiplicative_offset]\n"
-      "vdup.32 q1, %[rounding_offset]\n"
-      "vdup.32 q2, %[shift]\n"
-      "vld1.32 {d6[], d7[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d8[], d9[]}, [%[offsets]:32]!\n"
-      "vld1.32 {d10[], d11[]}, [%[offsets]:32]!\n"
-      "add r0, %[source], %[stride]\n"
-      "add r1, %[destination], %[destination_stride]\n"
-      "add r2, r0, %[stride]\n"
-      "add r3, r1, %[destination_stride]\n"
-      "subs %[count], %[count], #7\n"
-      "beq 2f\n"
-
-      "1:"
-      "subs %[count], %[count], #8\n"
-      "vld1.32 {d12, d13, d14, d15}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18, d19}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22, d23}, [r2:64]!\n"
-      "pld [%[source]]\n"
-      "pld [r0]\n"
-      "pld [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.8 {d24}, [%[destination]]!\n"
-      "vst1.8 {d26}, [r1]!\n"
-      "vst1.8 {d28}, [r3]!\n"
-
-      "bne 1b\n"
-      "2:"
-      "vld1.32 {d12, d13, d14}, [%[source]:64]!\n"
-      "vld1.32 {d16, d17, d18}, [r0:64]!\n"
-      "vld1.32 {d20, d21, d22}, [r2:64]!\n"
-      "vld1.32 {d15[0]}, [%[source]]\n"
-      "vld1.32 {d19[0]}, [r0]\n"
-      "vld1.32 {d23[0]}, [r2]\n"
-      "vadd.i32 q6, q6, q3\n"
-      "vadd.i32 q7, q7, q3\n"
-      "vadd.i32 q8, q8, q4\n"
-      "vadd.i32 q9, q9, q4\n"
-      "vadd.i32 q10, q10, q5\n"
-      "vadd.i32 q11, q11, q5\n"
-      "vmul.i32 q6, q6, q0\n"
-      "vmul.i32 q7, q7, q0\n"
-      "vmul.i32 q8, q8, q0\n"
-      "vmul.i32 q9, q9, q0\n"
-      "vmul.i32 q10, q10, q0\n"
-      "vmul.i32 q11, q11, q0\n"
-      "vadd.i32 q6, q6, q1\n"
-      "vadd.i32 q7, q7, q1\n"
-      "vadd.i32 q8, q8, q1\n"
-      "vadd.i32 q9, q9, q1\n"
-      "vadd.i32 q10, q10, q1\n"
-      "vadd.i32 q11, q11, q1\n"
-      "vshl.s32 q6, q6, q2\n"
-      "vshl.s32 q7, q7, q2\n"
-      "vshl.s32 q8, q8, q2\n"
-      "vshl.s32 q9, q9, q2\n"
-      "vshl.s32 q10, q10, q2\n"
-      "vshl.s32 q11, q11, q2\n"
-      "vqmovn.s32 d24, q6\n"
-      "vqmovn.s32 d25, q7\n"
-      "vqmovn.s32 d26, q8\n"
-      "vqmovn.s32 d27, q9\n"
-      "vqmovn.s32 d28, q10\n"
-      "vqmovn.s32 d29, q11\n"
-      "vqmovun.s16 d24, q12\n"
-      "vqmovun.s16 d26, q13\n"
-      "vqmovun.s16 d28, q14\n"
-      "vst1.32 {d24[0]}, [%[destination]]!\n"
-      "vst1.32 {d26[0]}, [r1]!\n"
-      "vst1.32 {d28[0]}, [r3]!\n"
-      "vst1.16 {d24[2]}, [%[destination]]!\n"
-      "vst1.16 {d26[2]}, [r1]!\n"
-      "vst1.16 {d28[2]}, [r3]!\n"
-      "vst1.8 {d24[6]}, [%[destination]]!\n"
-      "vst1.8 {d26[6]}, [r1]!\n"
-      "vst1.8 {d28[6]}, [r3]!\n"
-      : [count] "+r"(count),
-        [multiplicative_offset] "+r"(multiplicative_offset),
-        [stride] "+r"(stride), [shift] "+r"(shift),
-        [destination] "+r"(destination), [offsets] "+r"(offsets),
-        [source] "+r"(source), [destination_stride] "+r"(destination_stride),
-        [rounding_offset] "+r"(rounding_offset)
-      :
-      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
-        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
-        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
-        "d28", "d29", "cc", "memory");
-}
-
-void multi_qnt_1x8_aligned(const std::int32_t* source, std::int32_t count,
-                           std::int32_t stride, const std::int32_t* offsets,
-                           std::uint8_t* destination,
-                           std::int32_t destination_stride,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t rounding_offset, std::int32_t shift) {
-  switch (count % 8) {
-    case 0:
-      qnt_1x8_aligned(source, count, stride, offsets, destination,
-                      destination_stride, multiplicative_offset,
-                      rounding_offset, shift);
-      break;
-    case 1:
-      qnt_1x8_1_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 2:
-      qnt_1x8_2_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 3:
-      qnt_1x8_3_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 4:
-      qnt_1x8_4_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 5:
-      qnt_1x8_5_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 6:
-      qnt_1x8_6_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 7:
-      qnt_1x8_7_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-  }
-}
-
-void multi_qnt_2x8_aligned(const std::int32_t* source, std::int32_t count,
-                           std::int32_t stride, const std::int32_t* offsets,
-                           std::uint8_t* destination,
-                           std::int32_t destination_stride,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t rounding_offset, std::int32_t shift) {
-  switch (count % 8) {
-    case 0:
-      qnt_2x8_aligned(source, count, stride, offsets, destination,
-                      destination_stride, multiplicative_offset,
-                      rounding_offset, shift);
-      break;
-    case 1:
-      qnt_2x8_1_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 2:
-      qnt_2x8_2_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 3:
-      qnt_2x8_3_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 4:
-      qnt_2x8_4_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 5:
-      qnt_2x8_5_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 6:
-      qnt_2x8_6_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 7:
-      qnt_2x8_7_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-  }
-}
-
-void multi_qnt_3x8_aligned(const std::int32_t* source, std::int32_t count,
-                           std::int32_t stride, const std::int32_t* offsets,
-                           std::uint8_t* destination,
-                           std::int32_t destination_stride,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t rounding_offset, std::int32_t shift) {
-  switch (count % 8) {
-    case 0:
-      qnt_3x8_aligned(source, count, stride, offsets, destination,
-                      destination_stride, multiplicative_offset,
-                      rounding_offset, shift);
-      break;
-    case 1:
-      qnt_3x8_1_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 2:
-      qnt_3x8_2_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 3:
-      qnt_3x8_3_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 4:
-      qnt_3x8_4_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 5:
-      qnt_3x8_5_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 6:
-      qnt_3x8_6_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-    case 7:
-      qnt_3x8_7_aligned(source, count, stride, offsets, destination,
-                        destination_stride, multiplicative_offset,
-                        rounding_offset, shift);
-      break;
-  }
-}
-
-void multi_qnt_1x8(const std::int32_t* source, std::int32_t count,
-                   std::int32_t stride, const std::int32_t* offsets,
-                   std::uint8_t* destination, std::int32_t destination_stride,
-                   std::int32_t multiplicative_offset,
-                   std::int32_t rounding_offset, std::int32_t shift) {
-  switch (count % 8) {
-    case 0:
-      qnt_1x8(source, count, stride, offsets, destination, destination_stride,
-              multiplicative_offset, rounding_offset, shift);
-      break;
-    case 1:
-      qnt_1x8_1(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 2:
-      qnt_1x8_2(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 3:
-      qnt_1x8_3(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 4:
-      qnt_1x8_4(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 5:
-      qnt_1x8_5(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 6:
-      qnt_1x8_6(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 7:
-      qnt_1x8_7(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-  }
-}
-
-void multi_qnt_2x8(const std::int32_t* source, std::int32_t count,
-                   std::int32_t stride, const std::int32_t* offsets,
-                   std::uint8_t* destination, std::int32_t destination_stride,
-                   std::int32_t multiplicative_offset,
-                   std::int32_t rounding_offset, std::int32_t shift) {
-  switch (count % 8) {
-    case 0:
-      qnt_2x8(source, count, stride, offsets, destination, destination_stride,
-              multiplicative_offset, rounding_offset, shift);
-      break;
-    case 1:
-      qnt_2x8_1(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 2:
-      qnt_2x8_2(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 3:
-      qnt_2x8_3(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 4:
-      qnt_2x8_4(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 5:
-      qnt_2x8_5(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 6:
-      qnt_2x8_6(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 7:
-      qnt_2x8_7(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-  }
-}
-
-void multi_qnt_3x8(const std::int32_t* source, std::int32_t count,
-                   std::int32_t stride, const std::int32_t* offsets,
-                   std::uint8_t* destination, std::int32_t destination_stride,
-                   std::int32_t multiplicative_offset,
-                   std::int32_t rounding_offset, std::int32_t shift) {
-  switch (count % 8) {
-    case 0:
-      qnt_3x8(source, count, stride, offsets, destination, destination_stride,
-              multiplicative_offset, rounding_offset, shift);
-      break;
-    case 1:
-      qnt_3x8_1(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 2:
-      qnt_3x8_2(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 3:
-      qnt_3x8_3(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 4:
-      qnt_3x8_4(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 5:
-      qnt_3x8_5(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 6:
-      qnt_3x8_6(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-    case 7:
-      qnt_3x8_7(source, count, stride, offsets, destination, destination_stride,
-                multiplicative_offset, rounding_offset, shift);
-      break;
-  }
-}
-
-void gemm_q8_0_0_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_1_0_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_1_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                           const std::uint8_t* rhs, std::int32_t m,
-                           std::int32_t n, std::int32_t k,
-                           std::int32_t lhs_offset, std::int32_t rhs_offset,
-                           std::int32_t result_offset,
-                           std::int32_t multiplicative_offset,
-                           std::int32_t shift, std::uint8_t* result,
-                           std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                          zipped_lhs_3_offsets, result_chunk, result_stride,
-                          multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8_aligned(temp_result, n, mul_result_chunk_stride_bytes,
-                        zipped_lhs_2_offsets, result_chunk, result_stride,
-                        multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_0_0_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_0_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_1_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_0_2_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_q8_1_0_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_0_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_1_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_1_2_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_1x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_1_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_0_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_1_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_q8_2_2_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                   const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                   std::int32_t k, std::int32_t lhs_offset,
-                   std::int32_t rhs_offset, std::int32_t result_offset,
-                   std::int32_t multiplicative_offset, std::int32_t shift,
-                   std::uint8_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k + result_offset;
-  const std::int32_t rounding_offset = (1 << (shift - 1));
-  std::int32_t* temp_result = reinterpret_cast<std::int32_t*>(
-      scratch + zipped_chunk_size + zipped_rhs_size);
-  std::uint8_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = temp_result;
-  const std::int32_t mul_result_chunk_stride_bytes = ((n * 4 + 7) / 8) * 8;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = temp_result;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                               mul_result_chunk, mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    multi_qnt_3x8(temp_result, n, mul_result_chunk_stride_bytes,
-                  zipped_lhs_3_offsets, result_chunk, result_stride,
-                  multiplicative_offset, rounding_offset, -shift);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = temp_result;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                             mul_result_chunk, mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                           mul_result_chunk, mul_result_chunk_stride_bytes);
-  multi_qnt_2x8(temp_result, n, mul_result_chunk_stride_bytes,
-                zipped_lhs_2_offsets, result_chunk, result_stride,
-                multiplicative_offset, rounding_offset, -shift);
-}
-
-void gemm_i32_0_0_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_1_0_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_1_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_0_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_1_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                            const std::uint8_t* rhs, std::int32_t m,
-                            std::int32_t n, std::int32_t k,
-                            std::int32_t lhs_offset, std::int32_t rhs_offset,
-                            std::int32_t* result, std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_0_0_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_0_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_1_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_0_2_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_i32_1_0_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_0_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_1_1_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_1_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_1_2_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_0_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_0_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_i32_2_1_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_1_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_i32_2_2_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, std::int32_t* result,
-                    std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  std::int32_t* result_chunk = result;
-  std::int32_t* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                      mul_result_chunk,
-                                      mul_result_chunk_stride_bytes);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                    mul_result_chunk,
-                                    mul_result_chunk_stride_bytes);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_int32_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes);
-}
-
-void gemm_f_0_0_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_1_0_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_1_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_0_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_1_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_0_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_1_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_2_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_3_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_4_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_5_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_6_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_7_aligned(std::uint8_t* scratch, const std::uint8_t* lhs,
-                          const std::uint8_t* rhs, std::int32_t m,
-                          std::int32_t n, std::int32_t k,
-                          std::int32_t lhs_offset, std::int32_t rhs_offset,
-                          float result_scale, float* result,
-                          std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7_aligned(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7_aligned(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_0_0_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_0_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_1_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_0_2_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-}
-
-void gemm_f_1_0_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_0_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_1_1_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_1_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_1_2_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_1_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 1);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_1x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_1x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_1x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_0_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_0_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-}
-
-void gemm_f_2_1_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_1_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_1x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_1x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_1x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_0(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_1(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_1(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_1(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_2(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_2(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_2(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_3(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_3(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_3(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_4(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_4(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_4(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_5(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_5(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_5(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_6(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_6(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_6(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
-}
-
-void gemm_f_2_2_7(std::uint8_t* scratch, const std::uint8_t* lhs,
-                  const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                  std::int32_t k, std::int32_t lhs_offset,
-                  std::int32_t rhs_offset, float result_scale, float* result,
-                  std::int32_t result_stride) {
-  const std::int32_t row_chunks = m / 3;
-  const std::int32_t col_chunks = n / 3;
-  const std::int32_t padded_k = ((k + 7) / 8) * 8;
-  const std::int32_t chunk_size = k * 3;
-  const std::int32_t zipped_chunk_size = (padded_k + 16) * 3;
-  const std::int32_t zipped_rhs_size = (padded_k + 16) * n;
-  const std::uint8_t* lhs_chunk = lhs;
-  const std::uint8_t* rhs_chunk = rhs;
-  std::uint8_t* zipped_lhs = scratch;
-  std::int32_t* zipped_lhs_3_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 3);
-  std::int32_t* zipped_lhs_2_offsets =
-      reinterpret_cast<std::int32_t*>(zipped_lhs + padded_k * 2);
-  std::uint8_t* zipped_rhs = scratch + zipped_chunk_size;
-  std::uint8_t* zipped_rhs_chunk = zipped_rhs;
-  const std::int32_t result_chunk_stride = result_stride * 3;
-
-  const std::int32_t const_offset = lhs_offset * rhs_offset * k;
-  float* result_chunk = result;
-  float* mul_result_chunk = result;
-  const std::int32_t mul_result_chunk_stride_bytes = result_stride * 4;
-
-  for (int i = 0; i < col_chunks; ++i) {
-    zip_3x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-    rhs_chunk += chunk_size;
-    zipped_rhs_chunk += zipped_chunk_size;
-  }
-  zip_2x8_7(rhs_chunk, k, k, zipped_rhs_chunk, lhs_offset, 0);
-
-  for (int i = 0; i < row_chunks; ++i) {
-    zip_3x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-    zipped_rhs_chunk = zipped_rhs;
-    mul_result_chunk = result_chunk;
-    for (int j = 0; j < col_chunks; ++j) {
-      mul_3x8_3x8_float_lhsadd_rhsadd(
-          zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-          mul_result_chunk_stride_bytes, result_scale);
-      zipped_rhs_chunk += zipped_chunk_size;
-      mul_result_chunk += 3;
-    }
-    mul_3x8_2x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    lhs_chunk += chunk_size;
-    result_chunk += result_chunk_stride;
-  }
-
-  zip_2x8_7(lhs_chunk, k, k, zipped_lhs, rhs_offset, const_offset);
-  zipped_rhs_chunk = zipped_rhs;
-  mul_result_chunk = result_chunk;
-  for (int j = 0; j < col_chunks; ++j) {
-    mul_2x8_3x8_float_lhsadd_rhsadd(
-        zipped_lhs, zipped_rhs_chunk, padded_k, mul_result_chunk,
-        mul_result_chunk_stride_bytes, result_scale);
-    zipped_rhs_chunk += zipped_chunk_size;
-    mul_result_chunk += 3;
-  }
-  mul_2x8_2x8_float_lhsadd_rhsadd(zipped_lhs, zipped_rhs_chunk, padded_k,
-                                  mul_result_chunk,
-                                  mul_result_chunk_stride_bytes, result_scale);
+  task_params->result =
+      StreamUtil<typename Params::OutType, typename Params::OutputStream>::
+          Offset(params.fused_kernel.output_stream, params.result, m_offset,
+                 n_offset);
 }
 
 }  // namespace internal
 
-void gemm_q8_strided(std::uint8_t* scratch, const std::uint8_t* lhs,
-                     const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                     std::int32_t k, std::int32_t lhs_offset,
-                     std::int32_t rhs_offset, std::int32_t result_offset,
-                     std::int32_t multiplicative_offset, std::int32_t shift,
-                     std::uint8_t* result, std::int32_t result_stride) {
-  const bool lhs_aligned = ((reinterpret_cast<std::uintptr_t>(lhs) % 8) == 0);
-  const bool rhs_aligned = ((reinterpret_cast<std::uintptr_t>(rhs) % 8) == 0);
-  const bool k_aligned = ((k % 8) == 0);
-  const bool result_aligned =
-      ((reinterpret_cast<std::uintptr_t>(result) % 8) == 0);
-  const bool result_stride_aligned = ((result_stride % 8) == 0);
-  const bool aligned = lhs_aligned && rhs_aligned && result_aligned &&
-                       k_aligned && result_stride_aligned;
-  if (aligned) {
-    switch (m % 3) {
-      case 0:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_0_0_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_0_0_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_0_0_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_0_0_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_0_0_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_0_0_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_0_0_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_0_0_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_0_1_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_0_1_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_0_1_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_0_1_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_0_1_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_0_1_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_0_1_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_0_1_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_0_2_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_0_2_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_0_2_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_0_2_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_0_2_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_0_2_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_0_2_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_0_2_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 1:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_1_0_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_1_0_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_1_0_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_1_0_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_1_0_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_1_0_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_1_0_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_1_0_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_1_1_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_1_1_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_1_1_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_1_1_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_1_1_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_1_1_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_1_1_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_1_1_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_1_2_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_1_2_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_1_2_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_1_2_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_1_2_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_1_2_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_1_2_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_1_2_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 2:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_2_0_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_2_0_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_2_0_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_2_0_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_2_0_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_2_0_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_2_0_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_2_0_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_2_1_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_2_1_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_2_1_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_2_1_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_2_1_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_2_1_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_2_1_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_2_1_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_2_2_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_2_2_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_2_2_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_2_2_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_2_2_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_2_2_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_2_2_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_2_2_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_offset, multiplicative_offset, shift, result,
-                    result_stride);
-                break;
-            }
-            break;
-        }
-        break;
+template <int cache_size = 256 * 1024>
+class GemmExecutorPackRHSCacheFriendly {
+ public:
+  template <typename P>
+  static int EstimateScratchSize(const P& params, int kernel_m, int kernel_n,
+                                 int kernel_k) {
+    return cache_size;
+  }
+
+  template <typename P, int m, int n, int k, int m_leftovers, int n_leftovers,
+            int k_leftovers>
+  static void ExecuteDispatch3D(const P& params) {
+    typedef Stream<typename P::InType, m, k, k_leftovers,
+                   typename P::LeftStream>
+        LeftStream;
+
+    typedef Stream<typename P::InType, n, k, k_leftovers,
+                   typename P::RightStream>
+        RightStream;
+
+    const int lhs_scratch = LeftStream::Scratch(params.left_stream);
+    const int rhs_scratch = RightStream::Scratch(params.right_stream);
+
+    const int cache_friendly_tasks_count =
+        internal::CalculateCacheFriendlyTasksCount(cache_size, lhs_scratch,
+                                                   rhs_scratch, params.n, n);
+
+    if (cache_friendly_tasks_count == 1) {
+      GemmExecutorPackRHS::ExecuteDispatch3D<P, m, n, k, m_leftovers,
+                                             n_leftovers, k_leftovers>(params);
+      return;
     }
-  } else {
-    switch (m % 3) {
-      case 0:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_0_0_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_0_0_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_0_0_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_0_0_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_0_0_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_0_0_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_0_0_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_0_0_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_0_1_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_0_1_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_0_1_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_0_1_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_0_1_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_0_1_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_0_1_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_0_1_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_0_2_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_0_2_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_0_2_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_0_2_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_0_2_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_0_2_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_0_2_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_0_2_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 1:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_1_0_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_1_0_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_1_0_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_1_0_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_1_0_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_1_0_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_1_0_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_1_0_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_1_1_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_1_1_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_1_1_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_1_1_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_1_1_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_1_1_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_1_1_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_1_1_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_1_2_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_1_2_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_1_2_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_1_2_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_1_2_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_1_2_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_1_2_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_1_2_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 2:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_2_0_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_2_0_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_2_0_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_2_0_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_2_0_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_2_0_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_2_0_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_2_0_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_2_1_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_2_1_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_2_1_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_2_1_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_2_1_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_2_1_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_2_1_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_2_1_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_q8_2_2_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 1:
-                internal::gemm_q8_2_2_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 2:
-                internal::gemm_q8_2_2_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 3:
-                internal::gemm_q8_2_2_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 4:
-                internal::gemm_q8_2_2_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 5:
-                internal::gemm_q8_2_2_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 6:
-                internal::gemm_q8_2_2_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-              case 7:
-                internal::gemm_q8_2_2_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                        rhs_offset, result_offset,
-                                        multiplicative_offset, shift, result,
-                                        result_stride);
-                break;
-            }
-            break;
-        }
-        break;
+
+    const int cache_friendly_dim = params.n / cache_friendly_tasks_count;
+
+    P task_params = params;
+    for (int i = 0; i < cache_friendly_tasks_count - 1; ++i) {
+      internal::UpdateCacheFriendlyTask(0, params.m, i * cache_friendly_dim,
+                                        cache_friendly_dim, params,
+                                        &task_params);
+      Gemm<GemmExecutorPackRHS, P, m, n, k>(task_params);
+    }
+    const int dim_sum = (cache_friendly_tasks_count - 1) * cache_friendly_dim;
+    internal::UpdateCacheFriendlyTask(0, params.m, dim_sum, params.n - dim_sum,
+                                      params, &task_params);
+    Gemm<GemmExecutorPackRHS, P, m, n, k>(task_params);
+  }
+};
+
+template <int cache_size = 256 * 1024>
+class GemmExecutorPackLHSCacheFriendly {
+ public:
+  template <typename P>
+  static int EstimateScratchSize(const P& params, int kernel_m, int kernel_n,
+                                 int kernel_k) {
+    return cache_size;
+  }
+
+  template <typename P, int m, int n, int k, int m_leftovers, int n_leftovers,
+            int k_leftovers>
+  static void ExecuteDispatch3D(const P& params) {
+    typedef Stream<typename P::InType, m, k, k_leftovers,
+                   typename P::LeftStream>
+        LeftStream;
+
+    typedef Stream<typename P::InType, n, k, k_leftovers,
+                   typename P::RightStream>
+        RightStream;
+
+    const int lhs_scratch = LeftStream::Scratch(params.left_stream);
+    const int rhs_scratch = RightStream::Scratch(params.right_stream);
+
+    const int cache_friendly_tasks_count =
+        internal::CalculateCacheFriendlyTasksCount(cache_size, rhs_scratch,
+                                                   lhs_scratch, params.m, m);
+
+    if (cache_friendly_tasks_count == 1) {
+      GemmExecutorPackLHS::ExecuteDispatch3D<P, m, n, k, m_leftovers,
+                                             n_leftovers, k_leftovers>(params);
+      return;
+    }
+
+    const int cache_friendly_dim = params.m / cache_friendly_tasks_count;
+
+    P task_params = params;
+    for (int i = 0; i < cache_friendly_tasks_count - 1; ++i) {
+      internal::UpdateCacheFriendlyTask(i * cache_friendly_dim,
+                                        cache_friendly_dim, 0, params.n, params,
+                                        &task_params);
+      Gemm<GemmExecutorPackLHS, P, m, n, k>(task_params);
+    }
+    const int dim_sum = (cache_friendly_tasks_count - 1) * cache_friendly_dim;
+    internal::UpdateCacheFriendlyTask(dim_sum, params.m - dim_sum, 0, params.n,
+                                      params, &task_params);
+    Gemm<GemmExecutorPackLHS, P, m, n, k>(task_params);
+  }
+};
+
+namespace internal {
+
+// Stage 3.
+
+template <typename E, typename P, int dim_m, int dim_n, int dim_k, int fixed_m,
+          int fixed_n, int variable_k>
+struct Dispatch3DStage3 {
+  static void Execute(const P& params, int k) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Dispatch(3): " << dim_m << "x" << dim_n << "x" << dim_k
+              << " : " << fixed_m << "x" << fixed_n << "x" << variable_k
+              << std::endl
+              << std::flush;
+#endif
+#endif
+    if (k == variable_k) {
+      E::template ExecuteDispatch3D<P, dim_m, dim_n, dim_k, fixed_m, fixed_n,
+                                    variable_k>(params);
+    } else {
+      Dispatch3DStage3<E, P, dim_m, dim_n, dim_k, fixed_m, fixed_n,
+                       variable_k - 1>::Execute(params, k);
     }
   }
-}
+};
 
-void gemm_i32_strided(std::uint8_t* scratch, const std::uint8_t* lhs,
-                      const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                      std::int32_t k, std::int32_t lhs_offset,
-                      std::int32_t rhs_offset, std::int32_t* result,
-                      std::int32_t result_stride) {
-  const bool lhs_aligned = ((reinterpret_cast<std::uintptr_t>(lhs) % 8) == 0);
-  const bool rhs_aligned = ((reinterpret_cast<std::uintptr_t>(rhs) % 8) == 0);
-  const bool k_aligned = ((k % 8) == 0);
-  const bool aligned = lhs_aligned && rhs_aligned && k_aligned;
-  if (aligned) {
-    switch (m % 3) {
-      case 0:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_0_0_0_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_0_0_1_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_0_0_2_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_0_0_3_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_0_0_4_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_0_0_5_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_0_0_6_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_0_0_7_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_0_1_0_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_0_1_1_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_0_1_2_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_0_1_3_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_0_1_4_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_0_1_5_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_0_1_6_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_0_1_7_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_0_2_0_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_0_2_1_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_0_2_2_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_0_2_3_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_0_2_4_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_0_2_5_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_0_2_6_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_0_2_7_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 1:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_1_0_0_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_1_0_1_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_1_0_2_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_1_0_3_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_1_0_4_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_1_0_5_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_1_0_6_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_1_0_7_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_1_1_0_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_1_1_1_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_1_1_2_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_1_1_3_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_1_1_4_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_1_1_5_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_1_1_6_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_1_1_7_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_1_2_0_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_1_2_1_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_1_2_2_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_1_2_3_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_1_2_4_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_1_2_5_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_1_2_6_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_1_2_7_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 2:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_2_0_0_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_2_0_1_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_2_0_2_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_2_0_3_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_2_0_4_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_2_0_5_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_2_0_6_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_2_0_7_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_2_1_0_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_2_1_1_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_2_1_2_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_2_1_3_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_2_1_4_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_2_1_5_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_2_1_6_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_2_1_7_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_2_2_0_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_2_2_1_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_2_2_2_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_2_2_3_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_2_2_4_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_2_2_5_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_2_2_6_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_2_2_7_aligned(scratch, lhs, rhs, m, n, k,
-                                                 lhs_offset, rhs_offset, result,
-                                                 result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-    }
-  } else {
-    switch (m % 3) {
-      case 0:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_0_0_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_0_0_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_0_0_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_0_0_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_0_0_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_0_0_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_0_0_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_0_0_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_0_1_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_0_1_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_0_1_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_0_1_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_0_1_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_0_1_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_0_1_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_0_1_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_0_2_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_0_2_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_0_2_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_0_2_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_0_2_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_0_2_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_0_2_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_0_2_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 1:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_1_0_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_1_0_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_1_0_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_1_0_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_1_0_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_1_0_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_1_0_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_1_0_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_1_1_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_1_1_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_1_1_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_1_1_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_1_1_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_1_1_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_1_1_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_1_1_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_1_2_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_1_2_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_1_2_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_1_2_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_1_2_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_1_2_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_1_2_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_1_2_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 2:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_2_0_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_2_0_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_2_0_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_2_0_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_2_0_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_2_0_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_2_0_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_2_0_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_2_1_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_2_1_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_2_1_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_2_1_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_2_1_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_2_1_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_2_1_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_2_1_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_i32_2_2_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_i32_2_2_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_i32_2_2_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_i32_2_2_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_i32_2_2_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_i32_2_2_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_i32_2_2_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_i32_2_2_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                         rhs_offset, result, result_stride);
-                break;
-            }
-            break;
-        }
-        break;
+template <typename E, typename P, int dim_m, int dim_n, int dim_k, int fixed_m,
+          int fixed_n>
+struct Dispatch3DStage3<E, P, dim_m, dim_n, dim_k, fixed_m, fixed_n, 0> {
+  static void Execute(const P& params, int k) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Dispatch(3): " << dim_m << "x" << dim_n << "x" << dim_k
+              << " : " << fixed_m << "x" << fixed_n << "x" << 0 << std::endl
+              << std::flush;
+#endif
+#endif
+    if (k == 0) {
+      E::template ExecuteDispatch3D<P, dim_m, dim_n, dim_k, fixed_m, fixed_n,
+                                    0>(params);
+    } else {
+      std::cerr << "FATAL: dispatch3DStage3 failed: ran out of cases."
+                << std::endl
+                << std::flush;
+      std::exit(1);
     }
   }
-}
+};
 
-void gemm_f_strided(std::uint8_t* scratch, const std::uint8_t* lhs,
-                    const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-                    std::int32_t k, std::int32_t lhs_offset,
-                    std::int32_t rhs_offset, float result_scale, float* result,
-                    std::int32_t result_stride) {
-  const bool lhs_aligned = ((reinterpret_cast<std::uintptr_t>(lhs) % 8) == 0);
-  const bool rhs_aligned = ((reinterpret_cast<std::uintptr_t>(rhs) % 8) == 0);
-  const bool k_aligned = ((k % 8) == 0);
-  const bool aligned = lhs_aligned && rhs_aligned && k_aligned;
-  if (aligned) {
-    switch (m % 3) {
-      case 0:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_0_0_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_f_0_0_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_f_0_0_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_f_0_0_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_f_0_0_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_f_0_0_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_f_0_0_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_f_0_0_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_0_1_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_f_0_1_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_f_0_1_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_f_0_1_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_f_0_1_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_f_0_1_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_f_0_1_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_f_0_1_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_0_2_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_f_0_2_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_f_0_2_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_f_0_2_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_f_0_2_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_f_0_2_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_f_0_2_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_f_0_2_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 1:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_1_0_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_f_1_0_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_f_1_0_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_f_1_0_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_f_1_0_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_f_1_0_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_f_1_0_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_f_1_0_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_1_1_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_f_1_1_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_f_1_1_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_f_1_1_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_f_1_1_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_f_1_1_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_f_1_1_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_f_1_1_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_1_2_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_f_1_2_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_f_1_2_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_f_1_2_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_f_1_2_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_f_1_2_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_f_1_2_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_f_1_2_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 2:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_2_0_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_f_2_0_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_f_2_0_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_f_2_0_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_f_2_0_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_f_2_0_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_f_2_0_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_f_2_0_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_2_1_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_f_2_1_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_f_2_1_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_f_2_1_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_f_2_1_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_f_2_1_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_f_2_1_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_f_2_1_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_2_2_0_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 1:
-                internal::gemm_f_2_2_1_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 2:
-                internal::gemm_f_2_2_2_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 3:
-                internal::gemm_f_2_2_3_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 4:
-                internal::gemm_f_2_2_4_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 5:
-                internal::gemm_f_2_2_5_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 6:
-                internal::gemm_f_2_2_6_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-              case 7:
-                internal::gemm_f_2_2_7_aligned(
-                    scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                    result_scale, result, result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-    }
-  } else {
-    switch (m % 3) {
-      case 0:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_0_0_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 1:
-                internal::gemm_f_0_0_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 2:
-                internal::gemm_f_0_0_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 3:
-                internal::gemm_f_0_0_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 4:
-                internal::gemm_f_0_0_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 5:
-                internal::gemm_f_0_0_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 6:
-                internal::gemm_f_0_0_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 7:
-                internal::gemm_f_0_0_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_0_1_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 1:
-                internal::gemm_f_0_1_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 2:
-                internal::gemm_f_0_1_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 3:
-                internal::gemm_f_0_1_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 4:
-                internal::gemm_f_0_1_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 5:
-                internal::gemm_f_0_1_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 6:
-                internal::gemm_f_0_1_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 7:
-                internal::gemm_f_0_1_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_0_2_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 1:
-                internal::gemm_f_0_2_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 2:
-                internal::gemm_f_0_2_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 3:
-                internal::gemm_f_0_2_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 4:
-                internal::gemm_f_0_2_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 5:
-                internal::gemm_f_0_2_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 6:
-                internal::gemm_f_0_2_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 7:
-                internal::gemm_f_0_2_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 1:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_1_0_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 1:
-                internal::gemm_f_1_0_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 2:
-                internal::gemm_f_1_0_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 3:
-                internal::gemm_f_1_0_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 4:
-                internal::gemm_f_1_0_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 5:
-                internal::gemm_f_1_0_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 6:
-                internal::gemm_f_1_0_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 7:
-                internal::gemm_f_1_0_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_1_1_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 1:
-                internal::gemm_f_1_1_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 2:
-                internal::gemm_f_1_1_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 3:
-                internal::gemm_f_1_1_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 4:
-                internal::gemm_f_1_1_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 5:
-                internal::gemm_f_1_1_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 6:
-                internal::gemm_f_1_1_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 7:
-                internal::gemm_f_1_1_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_1_2_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 1:
-                internal::gemm_f_1_2_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 2:
-                internal::gemm_f_1_2_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 3:
-                internal::gemm_f_1_2_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 4:
-                internal::gemm_f_1_2_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 5:
-                internal::gemm_f_1_2_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 6:
-                internal::gemm_f_1_2_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 7:
-                internal::gemm_f_1_2_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-            }
-            break;
-        }
-        break;
-      case 2:
-        switch (n % 3) {
-          case 0:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_2_0_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 1:
-                internal::gemm_f_2_0_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 2:
-                internal::gemm_f_2_0_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 3:
-                internal::gemm_f_2_0_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 4:
-                internal::gemm_f_2_0_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 5:
-                internal::gemm_f_2_0_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 6:
-                internal::gemm_f_2_0_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 7:
-                internal::gemm_f_2_0_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-            }
-            break;
-          case 1:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_2_1_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 1:
-                internal::gemm_f_2_1_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 2:
-                internal::gemm_f_2_1_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 3:
-                internal::gemm_f_2_1_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 4:
-                internal::gemm_f_2_1_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 5:
-                internal::gemm_f_2_1_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 6:
-                internal::gemm_f_2_1_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 7:
-                internal::gemm_f_2_1_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-            }
-            break;
-          case 2:
-            switch (k % 8) {
-              case 0:
-                internal::gemm_f_2_2_0(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 1:
-                internal::gemm_f_2_2_1(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 2:
-                internal::gemm_f_2_2_2(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 3:
-                internal::gemm_f_2_2_3(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 4:
-                internal::gemm_f_2_2_4(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 5:
-                internal::gemm_f_2_2_5(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 6:
-                internal::gemm_f_2_2_6(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-              case 7:
-                internal::gemm_f_2_2_7(scratch, lhs, rhs, m, n, k, lhs_offset,
-                                       rhs_offset, result_scale, result,
-                                       result_stride);
-                break;
-            }
-            break;
-        }
-        break;
+// Stage 2.
+
+template <typename E, typename P, int dim_m, int dim_n, int dim_k, int fixed_m,
+          int variable_n>
+struct Dispatch3DStage2 {
+  static void Execute(const P& params, int n, int k) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Dispatch(2): " << dim_m << "x" << dim_n << "x" << dim_k
+              << " : " << fixed_m << "x" << variable_n << std::endl
+              << std::flush;
+#endif
+#endif
+    if (n == variable_n) {
+      Dispatch3DStage3<E, P, dim_m, dim_n, dim_k, fixed_m, variable_n,
+                       dim_k - 1>::Execute(params, k);
+    } else {
+      Dispatch3DStage2<E, P, dim_m, dim_n, dim_k, fixed_m,
+                       variable_n - 1>::Execute(params, n, k);
     }
   }
-}
+};
 
-void gemm_q8(std::uint8_t* scratch, const std::uint8_t* lhs,
-             const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-             std::int32_t k, std::int32_t lhs_offset, std::int32_t rhs_offset,
-             std::int32_t result_offset, std::int32_t multiplicative_offset,
-             std::int32_t shift, std::uint8_t* result) {
-  gemm_q8_strided(scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                  result_offset, multiplicative_offset, shift, result, n);
-}
+template <typename E, typename P, int dim_m, int dim_n, int dim_k, int fixed_m>
+struct Dispatch3DStage2<E, P, dim_m, dim_n, dim_k, fixed_m, 0> {
+  static void Execute(const P& params, int n, int k) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Dispatch(2): " << dim_m << "x" << dim_n << "x" << dim_k
+              << " : " << fixed_m << "x" << 0 << std::endl
+              << std::flush;
+#endif
+#endif
+    if (n == 0) {
+      Dispatch3DStage3<E, P, dim_m, dim_n, dim_k, fixed_m, 0,
+                       dim_k - 1>::Execute(params, k);
+    } else {
+      std::cerr << "FATAL: dispatch3DStage2 failed: ran out of cases."
+                << std::endl
+                << std::flush;
+      std::exit(1);
+    }
+  }
+};
 
-void gemm_i32(std::uint8_t* scratch, const std::uint8_t* lhs,
-              const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-              std::int32_t k, std::int32_t lhs_offset, std::int32_t rhs_offset,
-              std::int32_t* result) {
-  gemm_i32_strided(scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset, result,
-                   n);
-}
+// Stage 1.
 
-void gemm_f(std::uint8_t* scratch, const std::uint8_t* lhs,
-            const std::uint8_t* rhs, std::int32_t m, std::int32_t n,
-            std::int32_t k, std::int32_t lhs_offset, std::int32_t rhs_offset,
-            float result_scale, float* result) {
-  gemm_f_strided(scratch, lhs, rhs, m, n, k, lhs_offset, rhs_offset,
-                 result_scale, result, n);
+template <typename E, typename P, int dim_m, int dim_n, int dim_k,
+          int variable_m>
+struct Dispatch3DStage1 {
+  static void Execute(const P& params, int m, int n, int k) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Dispatch(1): " << dim_m << "x" << dim_n << "x" << dim_k
+              << " : " << variable_m << std::endl
+              << std::flush;
+#endif
+#endif
+    if (m == variable_m) {
+      Dispatch3DStage2<E, P, dim_m, dim_n, dim_k, variable_m,
+                       dim_n - 1>::Execute(params, n, k);
+    } else {
+      Dispatch3DStage1<E, P, dim_m, dim_n, dim_k, variable_m - 1>::Execute(
+          params, m, n, k);
+    }
+  }
+};
+
+template <typename E, typename P, int dim_m, int dim_n, int dim_k>
+struct Dispatch3DStage1<E, P, dim_m, dim_n, dim_k, 0> {
+  static void Execute(const P& params, int m, int n, int k) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Dispatch(1): " << dim_m << "x" << dim_n << "x" << dim_k
+              << " : " << 0 << std::endl
+              << std::flush;
+#endif
+#endif
+    if (m == 0) {
+      Dispatch3DStage2<E, P, dim_m, dim_n, dim_k, 0, dim_n - 1>::Execute(params,
+                                                                         n, k);
+    } else {
+      std::cerr << "FATAL: dispatch3DStage1 failed: ran out of cases."
+                << std::endl
+                << std::flush;
+      std::exit(1);
+    }
+  }
+};
+
+}  // namespace internal
+
+template <typename Executor, typename Params, int kernel_m, int kernel_n,
+          int kernel_k>
+inline void Gemm(const Params& params) {
+  internal::Dispatch3DStage1<Executor, Params, kernel_m, kernel_n, kernel_k,
+                             kernel_m - 1>::Execute(params, params.m % kernel_m,
+                                                    params.n % kernel_n,
+                                                    params.k % kernel_k);
 }
 
 }  // namespace meta
-
 }  // namespace gemmlowp
 
-#else
-#warning "Meta gemm fast-path requires GEMMLOWP_NEON_32!"
-#endif
-
 #endif  // GEMMLOWP_META_SINGLE_THREAD_GEMM_H_
diff --git a/meta/single_thread_transform.h b/meta/single_thread_transform.h
new file mode 100644
index 0000000..6f79529
--- /dev/null
+++ b/meta/single_thread_transform.h
@@ -0,0 +1,90 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_SINGLE_THREAD_TRANSFORM_H_
+#define GEMMLOWP_META_SINGLE_THREAD_TRANSFORM_H_
+
+#include <iostream>
+#include "base.h"
+
+namespace gemmlowp {
+namespace meta {
+
+template <typename Params, int kernel_size>
+void Transform1D(const Params& params);
+
+namespace internal {
+
+class Transform1DExecutor {
+ public:
+  template <typename P, int kernel_size, int leftovers>
+  static void ExecuteDispatch1D(const P& params) {
+    Transform1DKernel<typename P::InType, typename P::OutType,
+                      typename P::Kernel, kernel_size,
+                      leftovers>::Transform(params.input, params.kernel,
+                                            params.output);
+  }
+};
+
+template <typename E, typename P, int kernel_size, int variable_leftovers>
+struct Dispatch1D {
+  static void Execute(const P& params, int leftovers) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Dispatch(1): " << kernel_size << ":" << variable_leftovers
+              << std::endl
+              << std::flush;
+#endif
+#endif
+    if (leftovers == variable_leftovers) {
+      E::template ExecuteDispatch1D<P, kernel_size, variable_leftovers>(params);
+    } else {
+      Dispatch1D<E, P, kernel_size, variable_leftovers - 1>::Execute(params,
+                                                                     leftovers);
+    }
+  }
+};
+
+template <typename E, typename P, int kernel_size>
+struct Dispatch1D<E, P, kernel_size, 0> {
+  static void Execute(const P& params, int leftovers) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Dispatch(1): " << kernel_size << ": 0" << std::endl
+              << std::flush;
+#endif
+#endif
+    if (leftovers == 0) {
+      E::template ExecuteDispatch1D<P, kernel_size, 0>(params);
+    } else {
+      std::cerr << "FATAL: dispatch1D failed: ran out of cases." << std::endl
+                << std::flush;
+      std::exit(1);
+    }
+  }
+};
+
+}  // namespace internal
+
+template <typename Params, int kernel_size>
+inline void Transform1D(const Params& params) {
+  internal::Dispatch1D<internal::Transform1DExecutor, Params, kernel_size,
+                       kernel_size - 1>::Execute(params, params.kernel.count %
+                                                             kernel_size);
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#endif  // GEMMLOWP_META_SINGLE_THREAD_TRANSFORM_H_
diff --git a/meta/streams.h b/meta/streams.h
new file mode 100644
index 0000000..90ae7bb
--- /dev/null
+++ b/meta/streams.h
@@ -0,0 +1,312 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_STREAMS_H_
+#define GEMMLOWP_META_STREAMS_H_
+
+#include <iostream>
+#include <typeinfo>
+#include "base.h"
+
+namespace gemmlowp {
+namespace meta {
+
+struct RowMajor {
+ public:
+  int count;
+  int stride;
+};
+
+struct RowMajorWithSum {
+ public:
+  int count;
+  int stride;
+  int multiplicative_sum_offset;
+  int additive_sum_offset;
+};
+
+struct ColumnMajorWithSum {
+ public:
+  int count;
+  int stride;
+  int multiplicative_sum_offset;
+  int additive_sum_offset;
+};
+
+template <typename InType>
+class StreamUtil<InType, RowMajor> {
+ public:
+  static const InType* Offset(const RowMajor& params, const InType* source,
+                              int offset_stride, int offset_advance) {
+    return reinterpret_cast<const InType*>(
+        reinterpret_cast<const std::uint8_t*>(source) +
+        offset_stride * params.stride + offset_advance * sizeof(InType));
+  }
+
+  static InType* Offset(const RowMajor& params, InType* source,
+                        int offset_stride, int offset_advance) {
+    return reinterpret_cast<InType*>(reinterpret_cast<std::uint8_t*>(source) +
+                                     offset_stride * params.stride +
+                                     offset_advance * sizeof(InType));
+  }
+
+  static int Scratch(const RowMajor& params, int lanes_count, int pack_size) {
+    return AlignTo<64>(lanes_count * AlignTo(pack_size, params.stride));
+  }
+};
+
+template <typename InType>
+class StreamUtil<InType, RowMajorWithSum> {
+ public:
+  static const InType* Offset(const RowMajorWithSum& params,
+                              const InType* source, int offset_stride,
+                              int offset_advance) {
+    return reinterpret_cast<const InType*>(
+        reinterpret_cast<const std::uint8_t*>(source) +
+        offset_stride * params.stride + offset_advance * sizeof(InType));
+  }
+
+  static InType* Offset(const RowMajorWithSum& params, InType* source,
+                        int offset_stride, int offset_advance) {
+    return reinterpret_cast<InType*>(reinterpret_cast<std::uint8_t*>(source) +
+                                     offset_stride * params.stride +
+                                     offset_advance * sizeof(InType));
+  }
+
+  static int Scratch(const RowMajorWithSum& params, int lanes_count,
+                     int pack_size) {
+    return 32 + AlignTo<32>(sizeof(InType) * lanes_count *
+                            AlignTo(pack_size, params.count));
+  }
+};
+
+template <typename InType>
+class StreamUtil<InType, ColumnMajorWithSum> {
+ public:
+  static const InType* Offset(const ColumnMajorWithSum& params,
+                              const InType* source, int offset_stride,
+                              int offset_advance) {
+    return reinterpret_cast<const InType*>(
+        reinterpret_cast<const std::uint8_t*>(source) +
+        params.stride * offset_advance + offset_stride * sizeof(InType));
+  }
+
+  static const InType* Offset(const ColumnMajorWithSum& params, InType* source,
+                              int offset_stride, int offset_advance) {
+    return reinterpret_cast<InType*>(reinterpret_cast<std::uint8_t*>(source) +
+                                     params.stride * offset_advance +
+                                     offset_stride * sizeof(InType));
+  }
+
+  static int Scratch(const ColumnMajorWithSum& params, int lanes_count,
+                     int pack_size) {
+    return 32 + AlignTo<32>(sizeof(InType) * lanes_count *
+                            AlignTo(pack_size, params.count));
+  }
+};
+
+template <typename InType, int lanes_count, int pack_size, int leftovers>
+class Stream<InType, lanes_count, pack_size, leftovers, RowMajor> {
+ public:
+  static void Pack(const InType* in, const RowMajor& params, InType* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "RowMajor(" << std::string(typeid(InType).name())
+              << ")::Pack() -- " << lanes_count << "x" << pack_size << " + "
+              << leftovers << std::endl;
+#endif
+#else
+    if (lanes_count != 0) {
+      std::cerr << "FATAL: RowMajorWithSum::Pack not implemented." << std::endl;
+      std::exit(1);
+    }
+#endif
+  }
+
+  static int UnpackedAdvance(const RowMajor& params) {
+    return sizeof(InType) * pack_size;
+  }
+
+  static int PackedAdvance(const RowMajor& params) {
+    return sizeof(InType) * pack_size * lanes_count;
+  }
+
+  static int UnpackedStride(const RowMajor& params) {
+    return lanes_count * params.stride;
+  }
+
+  static int PackedStride(const RowMajor& params) {
+    return AlignTo<32>(lanes_count * AlignTo<pack_size>(params.stride));
+  }
+
+  static int Scratch(const RowMajor& params) { return PackedStride(params); }
+
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  static void Debug(const RowMajor& params) {
+    std::cout << "RowMajor(" << typeid(InType).name() << ")" << std::endl;
+    std::cout << "  dims: " << lanes_count << "x" << pack_size << " + "
+              << leftovers << std::endl;
+    std::cout << "  scratch: " << Scratch(params) << std::endl;
+    std::cout << "  unpacked advance: " << UnpackedAdvance(params) << std::endl;
+    std::cout << "  packed advance: " << PackedAdvance(params) << std::endl;
+    std::cout << "  unpacked stride: " << UnpackedStride(params) << std::endl;
+    std::cout << "  packed stride: " << PackedStride(params) << std::endl;
+    std::cout << "  params:" << std::endl;
+    std::cout << "    count: " << params.count << std::endl;
+    std::cout << "    stride: " << params.stride << std::endl;
+  }
+#endif
+#endif
+};
+
+template <typename InType, int lanes_count, int pack_size, int leftovers>
+class Stream<InType, lanes_count, pack_size, leftovers, RowMajorWithSum> {
+ public:
+  static void Pack(const InType* in, const RowMajorWithSum& params,
+                   InType* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "RowMajorWithSum(" << typeid(InType).name() << ")::Pack() -- "
+              << lanes_count << "x" << pack_size << " + " << leftovers
+              << std::endl;
+#endif
+#else
+    if (lanes_count != 0) {
+      std::cerr << "FATAL: RowMajorWithSum::Pack not implemented." << std::endl;
+      std::exit(1);
+    }
+#endif
+  }
+
+  static int UnpackedAdvance(const RowMajorWithSum& params) {
+    return sizeof(InType) * pack_size;
+  }
+
+  static int PackedAdvance(const RowMajorWithSum& params) {
+    return sizeof(InType) * pack_size * lanes_count;
+  }
+
+  static int UnpackedStride(const RowMajorWithSum& params) {
+    return sizeof(InType) * lanes_count * params.stride;
+  }
+
+  static int PackedStride(const RowMajorWithSum& params) {
+    return 32 + AlignTo<32>(sizeof(InType) * lanes_count *
+                            AlignTo<pack_size>(params.count));
+  }
+
+  static int Scratch(const RowMajorWithSum& params) {
+    return PackedStride(params);
+  }
+
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  static void Debug(const RowMajorWithSum& params) {
+    std::cout << "RowMajorWithSum(" << typeid(InType).name() << ")"
+              << std::endl;
+    std::cout << "  dims: " << lanes_count << "x" << pack_size << " + "
+              << leftovers << std::endl;
+    std::cout << "  scratch: " << Scratch(params) << std::endl;
+    std::cout << "  unpacked advance: " << UnpackedAdvance(params) << std::endl;
+    std::cout << "  packed advance: " << PackedAdvance(params) << std::endl;
+    std::cout << "  unpacked stride: " << UnpackedStride(params) << std::endl;
+    std::cout << "  packed stride: " << PackedStride(params) << std::endl;
+    std::cout << "  params:" << std::endl;
+    std::cout << "    count: " << params.count << std::endl;
+    std::cout << "    stride: " << params.stride << std::endl;
+    std::cout << "    multiplicative_sum_offset: "
+              << params.multiplicative_sum_offset << std::endl;
+    std::cout << "    additive_sum_offset: " << params.additive_sum_offset
+              << std::endl;
+  }
+#endif
+#endif
+};
+
+template <typename InType, int lanes_count, int pack_size, int leftovers>
+class Stream<InType, lanes_count, pack_size, leftovers, ColumnMajorWithSum> {
+ public:
+  static void Pack(const InType* in, const ColumnMajorWithSum& params,
+                   InType* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "ColumnMajorWithSum(" << typeid(InType).name()
+              << ")::Pack() -- " << lanes_count << "x" << pack_size << " + "
+              << leftovers << std::endl;
+#endif
+#else
+    if (lanes_count != 0) {
+      std::cerr << "FATAL: ColumnMajorWithSum::Pack not implemented."
+                << std::endl;
+      std::exit(1);
+    }
+#endif
+  }
+
+  static int UnpackedAdvance(const ColumnMajorWithSum& params) {
+    return sizeof(InType) * pack_size * params.stride;
+  }
+
+  static int PackedAdvance(const ColumnMajorWithSum& params) {
+    return sizeof(InType) * pack_size * lanes_count;
+  }
+
+  static int UnpackedStride(const ColumnMajorWithSum& params) {
+    return sizeof(InType) * lanes_count;
+  }
+
+  static int PackedStride(const ColumnMajorWithSum& params) {
+    return 32 + AlignTo<32>(sizeof(InType) * lanes_count *
+                            AlignTo<pack_size>(params.count));
+  }
+
+  static int Scratch(const ColumnMajorWithSum& params) {
+    return PackedStride(params);
+  }
+
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  static void Debug(const ColumnMajorWithSum& params) {
+    std::cout << "ColumnMajorWithSum(" << typeid(InType).name() << ")"
+              << std::endl;
+    std::cout << "  dims: " << lanes_count << "x" << pack_size << " + "
+              << leftovers << std::endl;
+    std::cout << "  scratch: " << Scratch(params) << std::endl;
+    std::cout << "  unpacked advance: " << UnpackedAdvance(params) << std::endl;
+    std::cout << "  packed advance: " << PackedAdvance(params) << std::endl;
+    std::cout << "  unpacked stride: " << UnpackedStride(params) << std::endl;
+    std::cout << "  packed stride: " << PackedStride(params) << std::endl;
+    std::cout << "  params:" << std::endl;
+    std::cout << "    count: " << params.count << std::endl;
+    std::cout << "    stride: " << params.stride << std::endl;
+    std::cout << "    multiplicative_sum_offset: "
+              << params.multiplicative_sum_offset << std::endl;
+    std::cout << "    additive_sum_offset: " << params.additive_sum_offset
+              << std::endl;
+  }
+#endif
+#endif
+};
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#ifdef GEMMLOWP_NEON_32
+#include "streams_arm_32.h"
+#elif defined(GEMMLOWP_NEON_64)
+#include "streams_arm_64.h"
+#endif
+
+#endif  // GEMMLOWP_META_STREAMS_H_
diff --git a/meta/streams_arm_32.h b/meta/streams_arm_32.h
new file mode 100644
index 0000000..8ef4cdd
--- /dev/null
+++ b/meta/streams_arm_32.h
@@ -0,0 +1,12248 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_STREAMS_ARM_32_H_
+#define GEMMLOWP_META_STREAMS_ARM_32_H_
+
+#ifdef GEMMLOWP_NEON_32
+
+#include <cassert>
+#include <cstdint>
+
+namespace gemmlowp {
+namespace meta {
+
+template <>
+inline void Stream<uint8_t, 1, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x1.
+      "vmov.i8 d0, #0\n"
+      "vld1.8 {d0[0]}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x2.
+      "vmov.i8 d0, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x3.
+      "vmov.i8 d0, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[2]}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x4.
+      "vmov.i8 d0, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x5.
+      "vmov.i8 d0, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[4]}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x6.
+      "vmov.i8 d0, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x7.
+      "vmov.i8 d0, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.8 {d0[6]}, [%[in]]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x1.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.8 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d1[0]}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x2.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x3.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[2]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[2]}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x4.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x5.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[4]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[4]}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x6.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x7.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.8 {d0[6]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.8 {d1[6]}, [r0]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x1.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld1.8 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d1[0]}, [r0]!\n"
+      "vld1.8 {d2[0]}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x2.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x3.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[2]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[2]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[2]}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x4.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x5.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[4]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[4]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[4]}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x6.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x7.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.8 {d0[6]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.8 {d1[6]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.8 {d2[6]}, [r1]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20",
+        "d21", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x1.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.8 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d1[0]}, [r0]!\n"
+      "vld1.8 {d2[0]}, [r1]!\n"
+      "vld1.8 {d3[0]}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x2.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x3.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[2]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[2]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[2]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[2]}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x4.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x5.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[4]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[4]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[4]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[4]}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x6.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x7.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.8 {d0[6]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.8 {d1[6]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.8 {d2[6]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vld1.8 {d3[6]}, [r2]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x1.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.8 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d1[0]}, [r0]!\n"
+      "vld1.8 {d2[0]}, [r1]!\n"
+      "vld1.8 {d3[0]}, [r2]!\n"
+      "vld1.8 {d4[0]}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x2.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vld1.16 {d4[0]}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x3.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[2]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[2]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[2]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[2]}, [r2]!\n"
+      "vld1.16 {d4[0]}, [r3]!\n"
+      "vld1.8 {d4[2]}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x4.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x5.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[4]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[4]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[4]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[4]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.8 {d4[4]}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x6.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.16 {d4[2]}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x7.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.8 {d0[6]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.8 {d1[6]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.8 {d2[6]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vld1.8 {d3[6]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.16 {d4[2]}, [r3]!\n"
+      "vld1.8 {d4[6]}, [r3]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "d0", "d1", "d2", "d3", "d4", "d5", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x1.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.8 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d1[0]}, [r0]!\n"
+      "vld1.8 {d2[0]}, [r1]!\n"
+      "vld1.8 {d3[0]}, [r2]!\n"
+      "vld1.8 {d4[0]}, [r3]!\n"
+      "vld1.8 {d5[0]}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "d0", "d1", "d2", "d3", "d4", "d5", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x2.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vld1.16 {d4[0]}, [r3]!\n"
+      "vld1.16 {d5[0]}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "d0", "d1", "d2", "d3", "d4", "d5", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x3.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[2]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[2]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[2]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[2]}, [r2]!\n"
+      "vld1.16 {d4[0]}, [r3]!\n"
+      "vld1.8 {d4[2]}, [r3]!\n"
+      "vld1.16 {d5[0]}, [r4]!\n"
+      "vld1.8 {d5[2]}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "d0", "d1", "d2", "d3", "d4", "d5", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x4.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "d0", "d1", "d2", "d3", "d4", "d5", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x5.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[4]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[4]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[4]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[4]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.8 {d4[4]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.8 {d5[4]}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "d0", "d1", "d2", "d3", "d4", "d5", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x6.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.16 {d4[2]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.16 {d5[2]}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "d0", "d1", "d2", "d3", "d4", "d5", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x7.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.8 {d0[6]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.8 {d1[6]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.8 {d2[6]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vld1.8 {d3[6]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.16 {d4[2]}, [r3]!\n"
+      "vld1.8 {d4[6]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.16 {d5[2]}, [r4]!\n"
+      "vld1.8 {d5[6]}, [r4]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "d0", "d1", "d2", "d3", "d4", "d5", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24",
+        "d25", "d26", "d27", "d28", "d29", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x1.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.8 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d1[0]}, [r0]!\n"
+      "vld1.8 {d2[0]}, [r1]!\n"
+      "vld1.8 {d3[0]}, [r2]!\n"
+      "vld1.8 {d4[0]}, [r3]!\n"
+      "vld1.8 {d5[0]}, [r4]!\n"
+      "vld1.8 {d6[0]}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24",
+        "d25", "d26", "d27", "d28", "d29", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x2.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vld1.16 {d4[0]}, [r3]!\n"
+      "vld1.16 {d5[0]}, [r4]!\n"
+      "vld1.16 {d6[0]}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24",
+        "d25", "d26", "d27", "d28", "d29", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x3.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[2]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[2]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[2]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[2]}, [r2]!\n"
+      "vld1.16 {d4[0]}, [r3]!\n"
+      "vld1.8 {d4[2]}, [r3]!\n"
+      "vld1.16 {d5[0]}, [r4]!\n"
+      "vld1.8 {d5[2]}, [r4]!\n"
+      "vld1.16 {d6[0]}, [r5]!\n"
+      "vld1.8 {d6[2]}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24",
+        "d25", "d26", "d27", "d28", "d29", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x4.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.32 {d6[0]}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24",
+        "d25", "d26", "d27", "d28", "d29", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x5.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[4]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[4]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[4]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[4]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.8 {d4[4]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.8 {d5[4]}, [r4]!\n"
+      "vld1.32 {d6[0]}, [r5]!\n"
+      "vld1.8 {d6[4]}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24",
+        "d25", "d26", "d27", "d28", "d29", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x6.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.16 {d4[2]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.16 {d5[2]}, [r4]!\n"
+      "vld1.32 {d6[0]}, [r5]!\n"
+      "vld1.16 {d6[2]}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24",
+        "d25", "d26", "d27", "d28", "d29", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x7.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.8 {d0[6]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.8 {d1[6]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.8 {d2[6]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vld1.8 {d3[6]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.16 {d4[2]}, [r3]!\n"
+      "vld1.8 {d4[6]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.16 {d5[2]}, [r4]!\n"
+      "vld1.8 {d5[6]}, [r4]!\n"
+      "vld1.32 {d6[0]}, [r5]!\n"
+      "vld1.16 {d6[2]}, [r5]!\n"
+      "vld1.8 {d6[6]}, [r5]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24",
+        "d25", "d26", "d27", "d28", "d29", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "add r6, r5, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vld1.32 {d7}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "d0", "d1", "d2", "d3", "d4",
+        "d5", "d6", "d7", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "add r6, r5, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vld1.32 {d7}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x1.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.8 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d1[0]}, [r0]!\n"
+      "vld1.8 {d2[0]}, [r1]!\n"
+      "vld1.8 {d3[0]}, [r2]!\n"
+      "vld1.8 {d4[0]}, [r3]!\n"
+      "vld1.8 {d5[0]}, [r4]!\n"
+      "vld1.8 {d6[0]}, [r5]!\n"
+      "vld1.8 {d7[0]}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "d0", "d1", "d2", "d3", "d4",
+        "d5", "d6", "d7", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "add r6, r5, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vld1.32 {d7}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x2.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vld1.16 {d4[0]}, [r3]!\n"
+      "vld1.16 {d5[0]}, [r4]!\n"
+      "vld1.16 {d6[0]}, [r5]!\n"
+      "vld1.16 {d7[0]}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "d0", "d1", "d2", "d3", "d4",
+        "d5", "d6", "d7", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "add r6, r5, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vld1.32 {d7}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x3.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.16 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[2]}, [%[in]]!\n"
+      "vld1.16 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[2]}, [r0]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[2]}, [r1]!\n"
+      "vld1.16 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[2]}, [r2]!\n"
+      "vld1.16 {d4[0]}, [r3]!\n"
+      "vld1.8 {d4[2]}, [r3]!\n"
+      "vld1.16 {d5[0]}, [r4]!\n"
+      "vld1.8 {d5[2]}, [r4]!\n"
+      "vld1.16 {d6[0]}, [r5]!\n"
+      "vld1.8 {d6[2]}, [r5]!\n"
+      "vld1.16 {d7[0]}, [r6]!\n"
+      "vld1.8 {d7[2]}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "d0", "d1", "d2", "d3", "d4",
+        "d5", "d6", "d7", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "add r6, r5, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vld1.32 {d7}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x4.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.32 {d6[0]}, [r5]!\n"
+      "vld1.32 {d7[0]}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "d0", "d1", "d2", "d3", "d4",
+        "d5", "d6", "d7", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "add r6, r5, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vld1.32 {d7}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x5.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d0[4]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.8 {d1[4]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[4]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.8 {d3[4]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.8 {d4[4]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.8 {d5[4]}, [r4]!\n"
+      "vld1.32 {d6[0]}, [r5]!\n"
+      "vld1.8 {d6[4]}, [r5]!\n"
+      "vld1.32 {d7[0]}, [r6]!\n"
+      "vld1.8 {d7[4]}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "d0", "d1", "d2", "d3", "d4",
+        "d5", "d6", "d7", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "add r6, r5, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vld1.32 {d7}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x6.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.16 {d4[2]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.16 {d5[2]}, [r4]!\n"
+      "vld1.32 {d6[0]}, [r5]!\n"
+      "vld1.16 {d6[2]}, [r5]!\n"
+      "vld1.32 {d7[0]}, [r6]!\n"
+      "vld1.16 {d7[2]}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "d0", "d1", "d2", "d3", "d4",
+        "d5", "d6", "d7", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add r0, %[in], %[stride]\n"
+      "add r1, r0, %[stride]\n"
+      "add r2, r1, %[stride]\n"
+      "add r3, r2, %[stride]\n"
+      "add r4, r3, %[stride]\n"
+      "add r5, r4, %[stride]\n"
+      "add r6, r5, %[stride]\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "vld1.32 {d0}, [%[in]]!\n"
+      "vld1.32 {d1}, [r0]!\n"
+      "vld1.32 {d2}, [r1]!\n"
+      "vld1.32 {d3}, [r2]!\n"
+      "vld1.32 {d4}, [r3]!\n"
+      "vld1.32 {d5}, [r4]!\n"
+      "vld1.32 {d6}, [r5]!\n"
+      "vld1.32 {d7}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x7.
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d0[2]}, [%[in]]!\n"
+      "vld1.8 {d0[6]}, [%[in]]!\n"
+      "vld1.32 {d1[0]}, [r0]!\n"
+      "vld1.16 {d1[2]}, [r0]!\n"
+      "vld1.8 {d1[6]}, [r0]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "vld1.16 {d2[2]}, [r1]!\n"
+      "vld1.8 {d2[6]}, [r1]!\n"
+      "vld1.32 {d3[0]}, [r2]!\n"
+      "vld1.16 {d3[2]}, [r2]!\n"
+      "vld1.8 {d3[6]}, [r2]!\n"
+      "vld1.32 {d4[0]}, [r3]!\n"
+      "vld1.16 {d4[2]}, [r3]!\n"
+      "vld1.8 {d4[6]}, [r3]!\n"
+      "vld1.32 {d5[0]}, [r4]!\n"
+      "vld1.16 {d5[2]}, [r4]!\n"
+      "vld1.8 {d5[6]}, [r4]!\n"
+      "vld1.32 {d6[0]}, [r5]!\n"
+      "vld1.16 {d6[2]}, [r5]!\n"
+      "vld1.8 {d6[6]}, [r5]!\n"
+      "vld1.32 {d7[0]}, [r6]!\n"
+      "vld1.16 {d7[2]}, [r6]!\n"
+      "vld1.8 {d7[6]}, [r6]!\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "ldr r0, %[multiplicative_sum_offset]\n"
+      "ldr r1, %[additive_sum_offset]\n"
+      "vmov.32 d0[0], r0\n"
+      "vdup.32 q1, r1\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "d0", "d1", "d2", "d3", "d4",
+        "d5", "d6", "d7", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[6]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[6]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x1
+      "vmov.i8 d0, #0\n"
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[6]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x2
+      "vmov.i8 d0, #0\n"
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[6]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x3
+      "vmov.i8 d0, #0\n"
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[6]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x4
+      "vmov.i8 d0, #0\n"
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[6]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x5
+      "vmov.i8 d0, #0\n"
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[6]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x6
+      "vmov.i8 d0, #0\n"
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[6]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x7
+      "vmov.i8 d0, #0\n"
+      "vld1.8 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[4]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[5]}, [%[in]], %[stride]\n"
+      "vld1.8 {d0[6]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vst1.32 {d0}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d16, d16, d16\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d2", "d3", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x1
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x2
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x3
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x4
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x5
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x6
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x7
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vld1.16 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[2]}, [%[in]], %[stride]\n"
+      "vld1.16 {d0[3]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.16 {d1[2]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vuzp.8 d0, d1\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vst1.32 {d0, d1}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[6], d1[6], d2[6]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[7], d1[7], d2[7]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[6], d1[6], d2[6]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[7], d1[7], d2[7]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x1
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[6], d1[6], d2[6]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[7], d1[7], d2[7]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x2
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[6], d1[6], d2[6]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[7], d1[7], d2[7]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x3
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[6], d1[6], d2[6]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[7], d1[7], d2[7]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x4
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[6], d1[6], d2[6]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[7], d1[7], d2[7]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x5
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[6], d1[6], d2[6]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[7], d1[7], d2[7]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x6
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[6], d1[6], d2[6]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[7], d1[7], d2[7]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x7
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vld3.8 {d0[0], d1[0], d2[0]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[1], d1[1], d2[1]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[2], d1[2], d2[2]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[3], d1[3], d2[3]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[4], d1[4], d2[4]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[5], d1[5], d2[5]}, [%[in]], %[stride]\n"
+      "vld3.8 {d0[6], d1[6], d2[6]}, [%[in]], %[stride]\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vst1.32 {d0, d1, d2}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d20\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "cc",
+        "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x1
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x2
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x3
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x4
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x5
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x6
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x7
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vld1.32 {d0[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vst1.32 {d16, d17}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d16", "d17", "d18", "d19", "d20", "d21", "d22",
+        "d23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.8 {d4[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.8 {d4[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d16", "d17", "d18", "d19", "d20", "d21",
+        "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.8 {d4[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.8 {d4[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x1
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d16", "d17", "d18", "d19", "d20", "d21",
+        "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.8 {d4[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.8 {d4[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x2
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d16", "d17", "d18", "d19", "d20", "d21",
+        "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.8 {d4[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.8 {d4[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x3
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d16", "d17", "d18", "d19", "d20", "d21",
+        "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.8 {d4[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.8 {d4[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x4
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d16", "d17", "d18", "d19", "d20", "d21",
+        "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.8 {d4[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.8 {d4[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x5
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d16", "d17", "d18", "d19", "d20", "d21",
+        "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.8 {d4[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.8 {d4[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x6
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d16", "d17", "d18", "d19", "d20", "d21",
+        "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.8 {d4[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.8 {d4[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x7
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.8 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.8 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.8 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.8 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.8 {d4[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.8 {d4[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.8 {d4[6]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d24\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d16", "d17", "d18", "d19", "d20", "d21",
+        "d22", "d23", "d24", "d25", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.16 {d5[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.16 {d5[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.16 {d5[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.16 {d5[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x1
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.16 {d5[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.16 {d5[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x2
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.16 {d5[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.16 {d5[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x3
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.16 {d5[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.16 {d5[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x4
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.16 {d5[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.16 {d5[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x5
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.16 {d5[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.16 {d5[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x6
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.16 {d5[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld1.16 {d5[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x7
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld1.16 {d4[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld1.16 {d4[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld1.16 {d4[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld1.16 {d4[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld1.16 {d5[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld1.16 {d5[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld1.16 {d5[2]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vuzp.8 d4, d5\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:128]!\n"
+      "vst1.32 {d4, d5}, [%[out]:128]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:128]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d16", "d17", "d18", "d19", "d20",
+        "d21", "d22", "d23", "d24", "d25", "d26", "d27", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld3.8 {d4[6], d5[6], d6[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld3.8 {d4[7], d5[7], d6[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld3.8 {d4[6], d5[6], d6[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld3.8 {d4[7], d5[7], d6[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x1
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld3.8 {d4[6], d5[6], d6[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld3.8 {d4[7], d5[7], d6[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x2
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld3.8 {d4[6], d5[6], d6[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld3.8 {d4[7], d5[7], d6[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x3
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld3.8 {d4[6], d5[6], d6[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld3.8 {d4[7], d5[7], d6[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x4
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld3.8 {d4[6], d5[6], d6[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld3.8 {d4[7], d5[7], d6[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x5
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld3.8 {d4[6], d5[6], d6[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld3.8 {d4[7], d5[7], d6[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x6
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %[stride], %[stride], #4\n"
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld3.8 {d4[6], d5[6], d6[6]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[1]}, [%[in]]!\n"
+      "vld3.8 {d4[7], d5[7], d6[7]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x7
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vld1.32 {d0[0]}, [%[in]]!\n"
+      "vld3.8 {d4[0], d5[0], d6[0]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[0]}, [%[in]]!\n"
+      "vld3.8 {d4[1], d5[1], d6[1]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[0]}, [%[in]]!\n"
+      "vld3.8 {d4[2], d5[2], d6[2]}, [%[in]], %[stride]\n"
+      "vld1.32 {d3[0]}, [%[in]]!\n"
+      "vld3.8 {d4[3], d5[3], d6[3]}, [%[in]], %[stride]\n"
+      "vld1.32 {d0[1]}, [%[in]]!\n"
+      "vld3.8 {d4[4], d5[4], d6[4]}, [%[in]], %[stride]\n"
+      "vld1.32 {d1[1]}, [%[in]]!\n"
+      "vld3.8 {d4[5], d5[5], d6[5]}, [%[in]], %[stride]\n"
+      "vld1.32 {d2[1]}, [%[in]]!\n"
+      "vld3.8 {d4[6], d5[6], d6[6]}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:64]!\n"
+      "vst1.32 {d4, d5, d6}, [%[out]:64]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d28\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:64]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "vld1.32 {d6}, [%[in]], %[stride]\n"
+      "vld1.32 {d7}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d16", "d17", "d18",
+        "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28",
+        "d29", "d30", "d31", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "vld1.32 {d6}, [%[in]], %[stride]\n"
+      "vld1.32 {d7}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x1
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d16", "d17", "d18",
+        "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28",
+        "d29", "d30", "d31", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "vld1.32 {d6}, [%[in]], %[stride]\n"
+      "vld1.32 {d7}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x2
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d16", "d17", "d18",
+        "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28",
+        "d29", "d30", "d31", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "vld1.32 {d6}, [%[in]], %[stride]\n"
+      "vld1.32 {d7}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x3
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d16", "d17", "d18",
+        "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28",
+        "d29", "d30", "d31", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "vld1.32 {d6}, [%[in]], %[stride]\n"
+      "vld1.32 {d7}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x4
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d16", "d17", "d18",
+        "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28",
+        "d29", "d30", "d31", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "vld1.32 {d6}, [%[in]], %[stride]\n"
+      "vld1.32 {d7}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x5
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d16", "d17", "d18",
+        "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28",
+        "d29", "d30", "d31", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "vld1.32 {d6}, [%[in]], %[stride]\n"
+      "vld1.32 {d7}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x6
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d16", "d17", "d18",
+        "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28",
+        "d29", "d30", "d31", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "vmov.i16 q8, #0\n"
+      "vmov.i16 q9, #0\n"
+      "vmov.i16 q10, #0\n"
+      "vmov.i16 q11, #0\n"
+      "vmov.i16 q12, #0\n"
+      "vmov.i16 q13, #0\n"
+      "vmov.i16 q14, #0\n"
+      "vmov.i16 q15, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "vld1.32 {d6}, [%[in]], %[stride]\n"
+      "vld1.32 {d7}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x7
+      "vmov.i8 d0, #0\n"
+      "vmov.i8 d1, #0\n"
+      "vmov.i8 d2, #0\n"
+      "vmov.i8 d3, #0\n"
+      "vmov.i8 d4, #0\n"
+      "vmov.i8 d5, #0\n"
+      "vmov.i8 d6, #0\n"
+      "vmov.i8 d7, #0\n"
+      "vld1.32 {d0}, [%[in]], %[stride]\n"
+      "vld1.32 {d1}, [%[in]], %[stride]\n"
+      "vld1.32 {d2}, [%[in]], %[stride]\n"
+      "vld1.32 {d3}, [%[in]], %[stride]\n"
+      "vld1.32 {d4}, [%[in]], %[stride]\n"
+      "vld1.32 {d5}, [%[in]], %[stride]\n"
+      "vld1.32 {d6}, [%[in]], %[stride]\n"
+      "pld [%[in]]\n"
+      "vtrn.8 d0, d1\n"
+      "vtrn.8 d2, d3\n"
+      "vtrn.8 d4, d5\n"
+      "vtrn.8 d6, d7\n"
+      "vtrn.16 d0, d2\n"
+      "vtrn.16 d1, d3\n"
+      "vtrn.16 d4, d6\n"
+      "vtrn.16 d5, d7\n"
+      "vtrn.32 d0, d4\n"
+      "vtrn.32 d1, d5\n"
+      "vtrn.32 d2, d6\n"
+      "vtrn.32 d3, d7\n"
+      "vaddw.u8 q8, q8, d0\n"
+      "vaddw.u8 q9, q9, d1\n"
+      "vaddw.u8 q10, q10, d2\n"
+      "vaddw.u8 q11, q11, d3\n"
+      "vaddw.u8 q12, q12, d4\n"
+      "vaddw.u8 q13, q13, d5\n"
+      "vaddw.u8 q14, q14, d6\n"
+      "vaddw.u8 q15, q15, d7\n"
+      "vst1.32 {d0, d1, d2, d3}, [%[out]:256]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[out]:256]!\n"
+
+      // Aggregator Reduction.
+      "vmov.32 d0[0], %[multiplicative_sum_offset]\n"
+      "vdup.32 q1, %[additive_sum_offset]\n"
+      "vpaddl.u16 q8, q8\n"
+      "vpaddl.u16 q9, q9\n"
+      "vpaddl.u16 q10, q10\n"
+      "vpaddl.u16 q11, q11\n"
+      "vpaddl.u16 q12, q12\n"
+      "vpaddl.u16 q13, q13\n"
+      "vpaddl.u16 q14, q14\n"
+      "vpaddl.u16 q15, q15\n"
+      "vpadd.u32 d16, d16, d17\n"
+      "vpadd.u32 d18, d18, d19\n"
+      "vpadd.u32 d20, d20, d21\n"
+      "vpadd.u32 d22, d22, d23\n"
+      "vpadd.u32 d24, d24, d25\n"
+      "vpadd.u32 d26, d26, d27\n"
+      "vpadd.u32 d28, d28, d29\n"
+      "vpadd.u32 d30, d30, d31\n"
+      "vpadd.u32 d16, d16, d18\n"
+      "vpadd.u32 d17, d20, d22\n"
+      "vpadd.u32 d18, d24, d26\n"
+      "vpadd.u32 d19, d28, d30\n"
+      "vmul.i32 q8, q8, d0[0]\n"
+      "vmul.i32 q9, q9, d0[0]\n"
+      "vadd.i32 q8, q8, q1\n"
+      "vadd.i32 q9, q9, q1\n"
+      "vst1.32 {d16, d17, d18, d19}, [%[out]:256]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d16", "d17", "d18",
+        "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28",
+        "d29", "d30", "d31", "cc", "memory");
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm for arm32 requires: GEMMLOWP_NEON_32!"
+#endif
+
+#endif  // GEMMLOWP_META_STREAMS_ARM_32_H_
diff --git a/meta/streams_arm_64.h b/meta/streams_arm_64.h
new file mode 100644
index 0000000..c17d336
--- /dev/null
+++ b/meta/streams_arm_64.h
@@ -0,0 +1,12273 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_STREAMS_ARM_64_H_
+#define GEMMLOWP_META_STREAMS_ARM_64_H_
+
+#ifdef GEMMLOWP_NEON_64
+
+#include <cassert>
+#include <cstdint>
+
+namespace gemmlowp {
+namespace meta {
+
+template <>
+inline void Stream<uint8_t, 1, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x1.
+      "movi v0.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x2.
+      "movi v0.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x3.
+      "movi v0.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v0.b}[2], [%x[in]], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x4.
+      "movi v0.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x5.
+      "movi v0.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.b}[4], [%x[in]], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x6.
+      "movi v0.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 1, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 1x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 1x7.
+      "movi v0.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v0.b}[6], [%x[in]], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "v8", "v9", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x1.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], #1\n"
+      "ld1 {v1.b}[0], [x0], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "v8", "v9", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x2.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "v8", "v9", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x3.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v0.b}[2], [%x[in]], #1\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v1.b}[2], [x0], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "v8", "v9", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x4.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "v8", "v9", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x5.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.b}[4], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.b}[4], [x0], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "v8", "v9", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x6.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "v8", "v9", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 2, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 2x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 2x7.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v0.b}[6], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v1.b}[6], [x0], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "st1 {v0.2s, v1.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "v8", "v9", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x1.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], #1\n"
+      "ld1 {v1.b}[0], [x0], #1\n"
+      "ld1 {v2.b}[0], [x1], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x2.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x3.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v0.b}[2], [%x[in]], #1\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v1.b}[2], [x0], #1\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v2.b}[2], [x1], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x4.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x5.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.b}[4], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.b}[4], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.b}[4], [x1], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x6.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 3, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 3x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 3x7.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v0.b}[6], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v1.b}[6], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v2.b}[6], [x1], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "v0", "v1", "v2", "v3", "v8", "v9", "v10", "v11",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x1.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], #1\n"
+      "ld1 {v1.b}[0], [x0], #1\n"
+      "ld1 {v2.b}[0], [x1], #1\n"
+      "ld1 {v3.b}[0], [x2], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "v0", "v1", "v2", "v3", "v8", "v9", "v10", "v11",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x2.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "v0", "v1", "v2", "v3", "v8", "v9", "v10", "v11",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x3.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v0.b}[2], [%x[in]], #1\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v1.b}[2], [x0], #1\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v2.b}[2], [x1], #1\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "ld1 {v3.b}[2], [x2], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "v0", "v1", "v2", "v3", "v8", "v9", "v10", "v11",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x4.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "v0", "v1", "v2", "v3", "v8", "v9", "v10", "v11",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x5.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.b}[4], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.b}[4], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.b}[4], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.b}[4], [x2], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "v0", "v1", "v2", "v3", "v8", "v9", "v10", "v11",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x6.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "v0", "v1", "v2", "v3", "v8", "v9", "v10", "v11",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 4, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 4x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 4x7.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v0.b}[6], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v1.b}[6], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v2.b}[6], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "ld1 {v3.b}[6], [x2], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "v0", "v1", "v2", "v3", "v8", "v9", "v10", "v11",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "v0", "v1", "v2", "v3", "v4", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x1.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], #1\n"
+      "ld1 {v1.b}[0], [x0], #1\n"
+      "ld1 {v2.b}[0], [x1], #1\n"
+      "ld1 {v3.b}[0], [x2], #1\n"
+      "ld1 {v4.b}[0], [x3], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "v0", "v1", "v2", "v3", "v4", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x2.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "ld1 {v4.h}[0], [x3], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "v0", "v1", "v2", "v3", "v4", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x3.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v0.b}[2], [%x[in]], #1\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v1.b}[2], [x0], #1\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v2.b}[2], [x1], #1\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "ld1 {v3.b}[2], [x2], #1\n"
+      "ld1 {v4.h}[0], [x3], #2\n"
+      "ld1 {v4.b}[2], [x3], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "v0", "v1", "v2", "v3", "v4", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x4.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "v0", "v1", "v2", "v3", "v4", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x5.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.b}[4], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.b}[4], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.b}[4], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.b}[4], [x2], #1\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.b}[4], [x3], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "v0", "v1", "v2", "v3", "v4", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x6.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.h}[2], [x3], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "v0", "v1", "v2", "v3", "v4", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 5, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 5x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 5x7.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v0.b}[6], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v1.b}[6], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v2.b}[6], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "ld1 {v3.b}[6], [x2], #1\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.h}[2], [x3], #2\n"
+      "ld1 {v4.b}[6], [x3], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "v0", "v1", "v2", "v3", "v4", "v8", "v9", "v10",
+        "v11", "v12", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "v0", "v1", "v2", "v3", "v4", "v5", "v8",
+        "v9", "v10", "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x1.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], #1\n"
+      "ld1 {v1.b}[0], [x0], #1\n"
+      "ld1 {v2.b}[0], [x1], #1\n"
+      "ld1 {v3.b}[0], [x2], #1\n"
+      "ld1 {v4.b}[0], [x3], #1\n"
+      "ld1 {v5.b}[0], [x4], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "v0", "v1", "v2", "v3", "v4", "v5", "v8",
+        "v9", "v10", "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x2.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "ld1 {v4.h}[0], [x3], #2\n"
+      "ld1 {v5.h}[0], [x4], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "v0", "v1", "v2", "v3", "v4", "v5", "v8",
+        "v9", "v10", "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x3.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v0.b}[2], [%x[in]], #1\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v1.b}[2], [x0], #1\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v2.b}[2], [x1], #1\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "ld1 {v3.b}[2], [x2], #1\n"
+      "ld1 {v4.h}[0], [x3], #2\n"
+      "ld1 {v4.b}[2], [x3], #1\n"
+      "ld1 {v5.h}[0], [x4], #2\n"
+      "ld1 {v5.b}[2], [x4], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "v0", "v1", "v2", "v3", "v4", "v5", "v8",
+        "v9", "v10", "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x4.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "v0", "v1", "v2", "v3", "v4", "v5", "v8",
+        "v9", "v10", "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x5.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.b}[4], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.b}[4], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.b}[4], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.b}[4], [x2], #1\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.b}[4], [x3], #1\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v5.b}[4], [x4], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "v0", "v1", "v2", "v3", "v4", "v5", "v8",
+        "v9", "v10", "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x6.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.h}[2], [x3], #2\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v5.h}[2], [x4], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "v0", "v1", "v2", "v3", "v4", "v5", "v8",
+        "v9", "v10", "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 6, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 6x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 6x7.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v0.b}[6], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v1.b}[6], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v2.b}[6], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "ld1 {v3.b}[6], [x2], #1\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.h}[2], [x3], #2\n"
+      "ld1 {v4.b}[6], [x3], #1\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v5.h}[2], [x4], #2\n"
+      "ld1 {v5.b}[6], [x4], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "r"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "v0", "v1", "v2", "v3", "v4", "v5", "v8",
+        "v9", "v10", "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "v0", "v1", "v2", "v3", "v4", "v5",
+        "v6", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x1.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], #1\n"
+      "ld1 {v1.b}[0], [x0], #1\n"
+      "ld1 {v2.b}[0], [x1], #1\n"
+      "ld1 {v3.b}[0], [x2], #1\n"
+      "ld1 {v4.b}[0], [x3], #1\n"
+      "ld1 {v5.b}[0], [x4], #1\n"
+      "ld1 {v6.b}[0], [x5], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "v0", "v1", "v2", "v3", "v4", "v5",
+        "v6", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x2.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "ld1 {v4.h}[0], [x3], #2\n"
+      "ld1 {v5.h}[0], [x4], #2\n"
+      "ld1 {v6.h}[0], [x5], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "v0", "v1", "v2", "v3", "v4", "v5",
+        "v6", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x3.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v0.b}[2], [%x[in]], #1\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v1.b}[2], [x0], #1\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v2.b}[2], [x1], #1\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "ld1 {v3.b}[2], [x2], #1\n"
+      "ld1 {v4.h}[0], [x3], #2\n"
+      "ld1 {v4.b}[2], [x3], #1\n"
+      "ld1 {v5.h}[0], [x4], #2\n"
+      "ld1 {v5.b}[2], [x4], #1\n"
+      "ld1 {v6.h}[0], [x5], #2\n"
+      "ld1 {v6.b}[2], [x5], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "v0", "v1", "v2", "v3", "v4", "v5",
+        "v6", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x4.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v6.s}[0], [x5], #4\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "v0", "v1", "v2", "v3", "v4", "v5",
+        "v6", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x5.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.b}[4], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.b}[4], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.b}[4], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.b}[4], [x2], #1\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.b}[4], [x3], #1\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v5.b}[4], [x4], #1\n"
+      "ld1 {v6.s}[0], [x5], #4\n"
+      "ld1 {v6.b}[4], [x5], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "v0", "v1", "v2", "v3", "v4", "v5",
+        "v6", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x6.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.h}[2], [x3], #2\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v5.h}[2], [x4], #2\n"
+      "ld1 {v6.s}[0], [x5], #4\n"
+      "ld1 {v6.h}[2], [x5], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "v0", "v1", "v2", "v3", "v4", "v5",
+        "v6", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 7, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 7x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 7x7.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v0.b}[6], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v1.b}[6], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v2.b}[6], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "ld1 {v3.b}[6], [x2], #1\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.h}[2], [x3], #2\n"
+      "ld1 {v4.b}[6], [x3], #1\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v5.h}[2], [x4], #2\n"
+      "ld1 {v5.b}[6], [x4], #1\n"
+      "ld1 {v6.s}[0], [x5], #4\n"
+      "ld1 {v6.h}[2], [x5], #2\n"
+      "ld1 {v6.b}[6], [x5], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "v0", "v1", "v2", "v3", "v4", "v5",
+        "v6", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 0, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 0, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "add x6, x5, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "ld1 {v7.2s}, [x6], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "v0", "v1", "v2", "v3", "v4",
+        "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 1, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 1, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "add x6, x5, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "ld1 {v7.2s}, [x6], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x1.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], #1\n"
+      "ld1 {v1.b}[0], [x0], #1\n"
+      "ld1 {v2.b}[0], [x1], #1\n"
+      "ld1 {v3.b}[0], [x2], #1\n"
+      "ld1 {v4.b}[0], [x3], #1\n"
+      "ld1 {v5.b}[0], [x4], #1\n"
+      "ld1 {v6.b}[0], [x5], #1\n"
+      "ld1 {v7.b}[0], [x6], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "v0", "v1", "v2", "v3", "v4",
+        "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 2, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 2, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "add x6, x5, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "ld1 {v7.2s}, [x6], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x2.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "ld1 {v4.h}[0], [x3], #2\n"
+      "ld1 {v5.h}[0], [x4], #2\n"
+      "ld1 {v6.h}[0], [x5], #2\n"
+      "ld1 {v7.h}[0], [x6], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "v0", "v1", "v2", "v3", "v4",
+        "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 3, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 3, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "add x6, x5, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "ld1 {v7.2s}, [x6], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x3.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], #2\n"
+      "ld1 {v0.b}[2], [%x[in]], #1\n"
+      "ld1 {v1.h}[0], [x0], #2\n"
+      "ld1 {v1.b}[2], [x0], #1\n"
+      "ld1 {v2.h}[0], [x1], #2\n"
+      "ld1 {v2.b}[2], [x1], #1\n"
+      "ld1 {v3.h}[0], [x2], #2\n"
+      "ld1 {v3.b}[2], [x2], #1\n"
+      "ld1 {v4.h}[0], [x3], #2\n"
+      "ld1 {v4.b}[2], [x3], #1\n"
+      "ld1 {v5.h}[0], [x4], #2\n"
+      "ld1 {v5.b}[2], [x4], #1\n"
+      "ld1 {v6.h}[0], [x5], #2\n"
+      "ld1 {v6.b}[2], [x5], #1\n"
+      "ld1 {v7.h}[0], [x6], #2\n"
+      "ld1 {v7.b}[2], [x6], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "v0", "v1", "v2", "v3", "v4",
+        "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 4, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 4, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "add x6, x5, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "ld1 {v7.2s}, [x6], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x4.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v6.s}[0], [x5], #4\n"
+      "ld1 {v7.s}[0], [x6], #4\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "v0", "v1", "v2", "v3", "v4",
+        "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 5, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 5, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "add x6, x5, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "ld1 {v7.2s}, [x6], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x5.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.b}[4], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.b}[4], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.b}[4], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.b}[4], [x2], #1\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.b}[4], [x3], #1\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v5.b}[4], [x4], #1\n"
+      "ld1 {v6.s}[0], [x5], #4\n"
+      "ld1 {v6.b}[4], [x5], #1\n"
+      "ld1 {v7.s}[0], [x6], #4\n"
+      "ld1 {v7.b}[4], [x6], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "v0", "v1", "v2", "v3", "v4",
+        "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 6, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 6, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "add x6, x5, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "ld1 {v7.2s}, [x6], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x6.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.h}[2], [x3], #2\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v5.h}[2], [x4], #2\n"
+      "ld1 {v6.s}[0], [x5], #4\n"
+      "ld1 {v6.h}[2], [x5], #2\n"
+      "ld1 {v7.s}[0], [x6], #4\n"
+      "ld1 {v7.h}[2], [x6], #2\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "v0", "v1", "v2", "v3", "v4",
+        "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 7, RowMajorWithSum>::Pack(
+    const uint8_t* in, const RowMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") RowMajorWithSum<uint8_t, 8, 8, 7, RowMajorWithSum>::Pack()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+      "add x0, %x[in], %x[stride]\n"
+      "add x1, x0, %x[stride]\n"
+      "add x2, x1, %x[stride]\n"
+      "add x3, x2, %x[stride]\n"
+      "add x4, x3, %x[stride]\n"
+      "add x5, x4, %x[stride]\n"
+      "add x6, x5, %x[stride]\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store: 8x8.
+      "ld1 {v0.2s}, [%x[in]], #8\n"
+      "ld1 {v1.2s}, [x0], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "ld1 {v3.2s}, [x2], #8\n"
+      "ld1 {v4.2s}, [x3], #8\n"
+      "ld1 {v5.2s}, [x4], #8\n"
+      "ld1 {v6.2s}, [x5], #8\n"
+      "ld1 {v7.2s}, [x6], #8\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store: 8x7.
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v0.h}[2], [%x[in]], #2\n"
+      "ld1 {v0.b}[6], [%x[in]], #1\n"
+      "ld1 {v1.s}[0], [x0], #4\n"
+      "ld1 {v1.h}[2], [x0], #2\n"
+      "ld1 {v1.b}[6], [x0], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v2.b}[6], [x1], #1\n"
+      "ld1 {v3.s}[0], [x2], #4\n"
+      "ld1 {v3.h}[2], [x2], #2\n"
+      "ld1 {v3.b}[6], [x2], #1\n"
+      "ld1 {v4.s}[0], [x3], #4\n"
+      "ld1 {v4.h}[2], [x3], #2\n"
+      "ld1 {v4.b}[6], [x3], #1\n"
+      "ld1 {v5.s}[0], [x4], #4\n"
+      "ld1 {v5.h}[2], [x4], #2\n"
+      "ld1 {v5.b}[6], [x4], #1\n"
+      "ld1 {v6.s}[0], [x5], #4\n"
+      "ld1 {v6.h}[2], [x5], #2\n"
+      "ld1 {v6.b}[6], [x5], #1\n"
+      "ld1 {v7.s}[0], [x6], #4\n"
+      "ld1 {v7.h}[2], [x6], #2\n"
+      "ld1 {v7.b}[6], [x6], #1\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "uaddw v15.8h, v15.8h, v7.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s, v7.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "ldr w0, %[multiplicative_sum_offset]\n"
+      "ldr w1, %[additive_sum_offset]\n"
+      "mov v0.s[0], w0\n"
+      "dup v1.4s, w1\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [in] "+r"(in), [out] "+r"(out)
+      : [stride] "r"(params.stride),
+        [multiplicative_sum_offset] "m"(params.multiplicative_sum_offset),
+        [additive_sum_offset] "m"(params.additive_sum_offset)
+      : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "v0", "v1", "v2", "v3", "v4",
+        "v5", "v6", "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+        "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x1
+      "movi v0.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x2
+      "movi v0.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x3
+      "movi v0.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x4
+      "movi v0.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x5
+      "movi v0.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x6
+      "movi v0.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 1, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 1, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 1x8
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 1x7
+      "movi v0.8b, #0\n"
+      "ld1 {v0.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v0.b}[6], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "st1 {v0.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v8", "v0", "v1", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v8", "v9", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x1
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v8", "v9", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x2
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v8", "v9", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x3
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v8", "v9", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x4
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v8", "v9", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x5
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v8", "v9", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x6
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v8", "v9", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 2, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 2, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 2x8
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 2x7
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "ld1 {v0.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v0.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.h}[2], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uzp1 v2.8b, v0.8b, v1.8b\n"
+      "uzp2 v3.8b, v0.8b, v1.8b\n"
+      "uaddw v8.8h, v8.8h, v2.8b\n"
+      "uaddw v9.8h, v9.8h, v3.8b\n"
+      "st1 {v2.2s, v3.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v8.4s, v8.4s, v8.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v8", "v9", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[6], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[6], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x1
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[6], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x2
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[6], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x3
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[6], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x4
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[6], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x5
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[6], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x6
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 3, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 3, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 3x8
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[6], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 3x7
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "ld3 {v0.b, v1.b, v2.b}[0], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[1], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[2], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[3], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[4], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[5], [%x[in]], %x[stride]\n"
+      "ld3 {v0.b, v1.b, v2.b}[6], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v10.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v8", "v9", "v10", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x1
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x2
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x3
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x4
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x5
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x6
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 4, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 4, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 4x8
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 4x7
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v4.4h, v0.4h, v2.4h\n"
+      "trn2 v6.4h, v0.4h, v2.4h\n"
+      "trn1 v5.4h, v1.4h, v3.4h\n"
+      "trn2 v7.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v4.8b, v5.8b\n"
+      "trn2 v1.8b, v4.8b, v5.8b\n"
+      "trn1 v2.8b, v6.8b, v7.8b\n"
+      "trn2 v3.8b, v6.8b, v7.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "st1 {v8.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x1
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x2
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x3
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x4
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x5
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x6
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 5, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 5, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 5x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 5x7
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v4.b}[6], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v5.4h, v0.4h, v2.4h\n"
+      "trn2 v7.4h, v0.4h, v2.4h\n"
+      "trn1 v6.4h, v1.4h, v3.4h\n"
+      "trn2 v13.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v5.8b, v6.8b\n"
+      "trn2 v1.8b, v5.8b, v6.8b\n"
+      "trn1 v2.8b, v7.8b, v13.8b\n"
+      "trn2 v3.8b, v7.8b, v13.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s}, [%x[out]], #8\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v12.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x1
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x2
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x3
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x4
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x5
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x6
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 6, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 6, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 6x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 6x7
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld1 {v4.h}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld1 {v5.h}[2], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v6.4h, v0.4h, v2.4h\n"
+      "trn2 v14.4h, v0.4h, v2.4h\n"
+      "trn1 v7.4h, v1.4h, v3.4h\n"
+      "trn2 v15.4h, v1.4h, v3.4h\n"
+      "uzp1 v16.8b, v4.8b, v5.8b\n"
+      "uzp2 v17.8b, v4.8b, v5.8b\n"
+      "trn1 v0.8b, v6.8b, v7.8b\n"
+      "trn2 v1.8b, v6.8b, v7.8b\n"
+      "trn1 v2.8b, v14.8b, v15.8b\n"
+      "trn2 v3.8b, v14.8b, v15.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v16.8b\n"
+      "uaddw v13.8h, v13.8h, v17.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v16.2s, v17.2s}, [%x[out]], #16\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v12.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x1
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x2
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x3
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x4
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x5
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x6
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 7, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 7, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "sub %x[stride], %x[stride], #4\n"
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 7x8
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[6], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[7], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 7x7
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "ld1 {v0.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[0], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[1], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[2], [%x[in]], %x[stride]\n"
+      "ld1 {v3.s}[0], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[3], [%x[in]], %x[stride]\n"
+      "ld1 {v0.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[4], [%x[in]], %x[stride]\n"
+      "ld1 {v1.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[5], [%x[in]], %x[stride]\n"
+      "ld1 {v2.s}[1], [%x[in]], #4\n"
+      "ld3 {v4.b, v5.b, v6.b}[6], [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v7.4h, v0.4h, v2.4h\n"
+      "trn2 v16.4h, v0.4h, v2.4h\n"
+      "trn1 v15.4h, v1.4h, v3.4h\n"
+      "trn2 v17.4h, v1.4h, v3.4h\n"
+      "trn1 v0.8b, v7.8b, v15.8b\n"
+      "trn2 v1.8b, v7.8b, v15.8b\n"
+      "trn1 v2.8b, v16.8b, v17.8b\n"
+      "trn2 v3.8b, v16.8b, v17.8b\n"
+      "uaddw v8.8h, v8.8h, v0.8b\n"
+      "uaddw v9.8h, v9.8h, v1.8b\n"
+      "uaddw v10.8h, v10.8h, v2.8b\n"
+      "uaddw v11.8h, v11.8h, v3.8b\n"
+      "uaddw v12.8h, v12.8h, v4.8b\n"
+      "uaddw v13.8h, v13.8h, v5.8b\n"
+      "uaddw v14.8h, v14.8h, v6.8b\n"
+      "st1 {v0.2s, v1.2s, v2.2s, v3.2s}, [%x[out]], #32\n"
+      "st1 {v4.2s, v5.2s, v6.2s}, [%x[out]], #24\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v14.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 0, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 0, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v6.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v7.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19", "v20",
+        "v21", "v22", "v23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 1, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 1, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v6.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v7.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x1
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19", "v20",
+        "v21", "v22", "v23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 2, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 2, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v6.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v7.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x2
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19", "v20",
+        "v21", "v22", "v23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 3, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 3, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v6.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v7.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x3
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19", "v20",
+        "v21", "v22", "v23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 4, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 4, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v6.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v7.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x4
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19", "v20",
+        "v21", "v22", "v23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 5, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 5, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v6.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v7.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x5
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19", "v20",
+        "v21", "v22", "v23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 6, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 6, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v6.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v7.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x6
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19", "v20",
+        "v21", "v22", "v23", "cc", "memory");
+}
+
+template <>
+inline void Stream<uint8_t, 8, 8, 7, ColumnMajorWithSum>::Pack(
+    const uint8_t* in, const ColumnMajorWithSum& params, uint8_t* out) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout
+      << __FILE__ << "(" << __LINE__
+      << ") ColumnMajorWithSum<uint8_t, 8, 8, 7, ColumnMajorWithSum>::Pack()"
+      << std::endl
+      << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  int params_stride_copy = params.stride;
+  asm volatile(
+      "movi v8.8h, #0\n"
+      "movi v9.8h, #0\n"
+      "movi v10.8h, #0\n"
+      "movi v11.8h, #0\n"
+      "movi v12.8h, #0\n"
+      "movi v13.8h, #0\n"
+      "movi v14.8h, #0\n"
+      "movi v15.8h, #0\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #8\n"
+
+      // Load Aggregate Store - column major 8x8
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v6.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v7.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      "bne 1b\n"
+
+      "2:"
+
+      // Load Aggregate Store - column major 8x7
+      "movi v0.8b, #0\n"
+      "movi v1.8b, #0\n"
+      "movi v2.8b, #0\n"
+      "movi v3.8b, #0\n"
+      "movi v4.8b, #0\n"
+      "movi v5.8b, #0\n"
+      "movi v6.8b, #0\n"
+      "movi v7.8b, #0\n"
+      "ld1 {v0.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v1.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v2.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v3.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v4.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v5.2s}, [%x[in]], %x[stride]\n"
+      "ld1 {v6.2s}, [%x[in]], %x[stride]\n"
+      "prfm pldl1keep, [%x[in]]\n"
+      "trn1 v16.8b, v0.8b, v1.8b\n"
+      "trn2 v17.8b, v0.8b, v1.8b\n"
+      "trn1 v18.8b, v2.8b, v3.8b\n"
+      "trn2 v19.8b, v2.8b, v3.8b\n"
+      "trn1 v20.8b, v4.8b, v5.8b\n"
+      "trn2 v21.8b, v4.8b, v5.8b\n"
+      "trn1 v22.8b, v6.8b, v7.8b\n"
+      "trn2 v23.8b, v6.8b, v7.8b\n"
+      "trn1 v0.4h, v16.4h, v18.4h\n"
+      "trn2 v2.4h, v16.4h, v18.4h\n"
+      "trn1 v1.4h, v17.4h, v19.4h\n"
+      "trn2 v3.4h, v17.4h, v19.4h\n"
+      "trn1 v4.4h, v20.4h, v22.4h\n"
+      "trn2 v6.4h, v20.4h, v22.4h\n"
+      "trn1 v5.4h, v21.4h, v23.4h\n"
+      "trn2 v7.4h, v21.4h, v23.4h\n"
+      "trn1 v16.2s, v0.2s, v4.2s\n"
+      "trn2 v20.2s, v0.2s, v4.2s\n"
+      "trn1 v17.2s, v1.2s, v5.2s\n"
+      "trn2 v21.2s, v1.2s, v5.2s\n"
+      "trn1 v18.2s, v2.2s, v6.2s\n"
+      "trn2 v22.2s, v2.2s, v6.2s\n"
+      "trn1 v19.2s, v3.2s, v7.2s\n"
+      "trn2 v23.2s, v3.2s, v7.2s\n"
+      "uaddw v8.8h, v8.8h, v16.8b\n"
+      "uaddw v9.8h, v9.8h, v17.8b\n"
+      "uaddw v10.8h, v10.8h, v18.8b\n"
+      "uaddw v11.8h, v11.8h, v19.8b\n"
+      "uaddw v12.8h, v12.8h, v20.8b\n"
+      "uaddw v13.8h, v13.8h, v21.8b\n"
+      "uaddw v14.8h, v14.8h, v22.8b\n"
+      "uaddw v15.8h, v15.8h, v23.8b\n"
+      "st1 {v16.2s, v17.2s, v18.2s, v19.2s}, [%x[out]], #32\n"
+      "st1 {v20.2s, v21.2s, v22.2s, v23.2s}, [%x[out]], #32\n"
+
+      // Aggregator Reduction.
+      "mov v0.s[0], %w[multiplicative_sum_offset]\n"
+      "dup v1.4s, %w[additive_sum_offset]\n"
+      "uaddlp v8.4s, v8.8h\n"
+      "uaddlp v9.4s, v9.8h\n"
+      "uaddlp v10.4s, v10.8h\n"
+      "uaddlp v11.4s, v11.8h\n"
+      "uaddlp v12.4s, v12.8h\n"
+      "uaddlp v13.4s, v13.8h\n"
+      "uaddlp v14.4s, v14.8h\n"
+      "uaddlp v15.4s, v15.8h\n"
+      "addp v8.4s, v8.4s, v9.4s\n"
+      "addp v10.4s, v10.4s, v11.4s\n"
+      "addp v12.4s, v12.4s, v13.4s\n"
+      "addp v14.4s, v14.4s, v15.4s\n"
+      "addp v8.4s, v8.4s, v10.4s\n"
+      "addp v9.4s, v12.4s, v14.4s\n"
+      "mul v8.4s, v8.4s, v0.s[0]\n"
+      "mul v9.4s, v9.4s, v0.s[0]\n"
+      "add v8.4s, v8.4s, v1.4s\n"
+      "add v9.4s, v9.4s, v1.4s\n"
+      "st1 {v8.4s, v9.4s}, [%x[out]]\n"
+      : [count] "+r"(params_count_copy), [stride] "+r"(params_stride_copy),
+        [out] "+r"(out), [in] "+r"(in)
+      : [additive_sum_offset] "r"(params.additive_sum_offset),
+        [multiplicative_sum_offset] "r"(params.multiplicative_sum_offset)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9", "v10",
+        "v11", "v12", "v13", "v14", "v15", "v16", "v17", "v18", "v19", "v20",
+        "v21", "v22", "v23", "cc", "memory");
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm for arm64 requires: GEMMLOWP_NEON_64!"
+#endif
+
+#endif  // GEMMLOWP_META_STREAMS_ARM_64_H_
diff --git a/meta/test_gemm_correctness.cc b/meta/test_gemm_correctness.cc
new file mode 100644
index 0000000..a2d704f
--- /dev/null
+++ b/meta/test_gemm_correctness.cc
@@ -0,0 +1,521 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <unistd.h>
+#ifdef __APPLE__
+#include <sys/time.h>
+#endif
+
+#include <cstdint>
+#include <cstdlib>
+#include <ctime>
+#include <iomanip>
+#include <iostream>
+#include <map>
+#include <memory>
+#include <vector>
+
+#include "multi_thread_gemm.h"
+#include "quantized_mul_kernels.h"
+#include "single_thread_gemm.h"
+#include "streams.h"
+
+#define LHS_OFFSET (-127)
+#define RHS_OFFSET (-127)
+#define SUM_OFFSET (127)
+#define MUL_OFFSET (1)
+#define SHIFT (7)
+#define FLOAT_SCALE (0.333f)
+
+using namespace gemmlowp::meta;
+
+// Input, output & kernel setups.
+
+typedef GemmParams<std::uint8_t, std::uint8_t, RowMajorWithSum, ColumnMajorWithSum,
+                   QuantizedStaticPreprocessed, RowMajor>
+    ParamsColumnMajor;
+
+typedef GemmParams<std::uint8_t, std::uint8_t, RowMajorWithSum, RowMajorWithSum,
+                   QuantizedStaticPreprocessed, RowMajor>
+    ParamsRowMajor;
+
+typedef GemmParams<std::uint8_t, float, RowMajorWithSum, ColumnMajorWithSum,
+                   QuantizedStaticPreprocessedAsFloat, RowMajor>
+    ParamsColumnMajorAsFloat;
+
+typedef GemmParams<std::uint8_t, float, RowMajorWithSum, RowMajorWithSum,
+                   QuantizedStaticPreprocessedAsFloat, RowMajor>
+    ParamsRowMajorAsFloat;
+
+typedef GemmParams<std::uint8_t, std::int32_t, RowMajorWithSum, ColumnMajorWithSum,
+                   QuantizedStaticPreprocessedAsInt32, RowMajor>
+    ParamsColumnMajorAsInt32;
+
+typedef GemmParams<std::uint8_t, std::int32_t, RowMajorWithSum, RowMajorWithSum,
+                   QuantizedStaticPreprocessedAsInt32, RowMajor>
+    ParamsRowMajorAsInt32;
+
+typedef gemmlowp::WorkersPool Pool;
+typedef SimpleContext<gemmlowp::WorkersPool> Context;
+
+#ifdef LHS_PACK
+typedef GemmExecutorPackLHSCacheFriendly<> Executor;
+#else
+typedef GemmExecutorPackRHSCacheFriendly<> Executor;
+#endif
+
+// Testing helper functions.
+
+void prepare_test_data(std::uint8_t* data, std::int32_t rows, std::int32_t cols,
+                       std::int32_t seed, std::int32_t seed_2) {
+  std::int32_t value = seed;
+  for (int i = 0; i < rows * cols; ++i) {
+    data[i] = static_cast<std::uint8_t>(value);
+    value = ((value * seed_2) + seed) % 256;
+  }
+}
+
+template <typename CLEAR_TYPE>
+void clear(int rows, int cols, CLEAR_TYPE* data) {
+  for (int i = 0; i < rows * cols; ++i) {
+    data[i] = 0;
+  }
+}
+
+bool check_row_row(std::uint8_t* lhs, std::uint8_t* rhs, std::uint8_t* results, int rows,
+                   int cols, int depth) {
+  int wrong = 0;
+  int rounding = (1 << (SHIFT - 1));
+  for (int i = 0; i < rows; ++i) {
+    for (int j = 0; j < cols; ++j) {
+      int expected = 0;
+      for (int k = 0; k < depth; ++k) {
+        expected += (static_cast<int>(lhs[depth * i + k]) + LHS_OFFSET) *
+                    (static_cast<int>(rhs[depth * j + k]) + RHS_OFFSET);
+      }
+      expected += SUM_OFFSET * depth;
+      expected *= MUL_OFFSET;
+      expected += rounding;
+      expected = (expected >> SHIFT);
+      if (expected < 0) {
+        expected = 0;
+      } else if (expected > 255) {
+        expected = 255;
+      }
+      expected = static_cast<int>(static_cast<std::uint8_t>(expected));
+      int actual = static_cast<int>(results[i * cols + j]);
+      if (actual != expected) {
+        std::cout << "Wrong @" << i << "x" << j << " : " << actual
+                  << " != " << expected << std::endl;
+        wrong++;
+      }
+    }
+  }
+  if (wrong != 0) {
+    std::cout << wrong << "/" << (rows * cols) << std::endl;
+  }
+  return wrong == 0;
+}
+
+bool check_row_col(std::uint8_t* lhs, std::uint8_t* rhs, std::uint8_t* results, int rows,
+                   int cols, int depth) {
+  int wrong = 0;
+  int rounding = (1 << (SHIFT - 1));
+  for (int i = 0; i < rows; ++i) {
+    for (int j = 0; j < cols; ++j) {
+      int expected = 0;
+      for (int k = 0; k < depth; ++k) {
+        expected += (static_cast<int>(lhs[depth * i + k]) + LHS_OFFSET) *
+                    (static_cast<int>(rhs[j + k * cols]) + RHS_OFFSET);
+      }
+      expected += SUM_OFFSET * depth;
+      expected *= MUL_OFFSET;
+      expected += rounding;
+      expected = (expected >> SHIFT);
+      if (expected < 0) {
+        expected = 0;
+      } else if (expected > 255) {
+        expected = 255;
+      }
+      expected = static_cast<int>(static_cast<std::uint8_t>(expected));
+      int actual = static_cast<int>(results[i * cols + j]);
+      if (actual != expected) {
+        wrong++;
+      }
+    }
+  }
+  return wrong == 0;
+}
+
+bool check_row_row_f(std::uint8_t* lhs, std::uint8_t* rhs, float* results, int rows,
+                     int cols, int depth) {
+  int wrong = 0;
+  for (int i = 0; i < rows; ++i) {
+    for (int j = 0; j < cols; ++j) {
+      int expected = 0;
+      for (int k = 0; k < depth; ++k) {
+        expected += (static_cast<int>(lhs[depth * i + k]) + LHS_OFFSET) *
+                    (static_cast<int>(rhs[depth * j + k]) + RHS_OFFSET);
+      }
+      float expected_float = static_cast<float>(expected) * FLOAT_SCALE;
+      float actual = results[i * cols + j];
+      if (actual != expected_float) {
+        wrong++;
+      }
+    }
+  }
+  return wrong == 0;
+}
+
+bool check_row_col_f(std::uint8_t* lhs, std::uint8_t* rhs, float* results, int rows,
+                     int cols, int depth) {
+  int wrong = 0;
+  for (int i = 0; i < rows; ++i) {
+    for (int j = 0; j < cols; ++j) {
+      int expected = 0;
+      for (int k = 0; k < depth; ++k) {
+        expected += (static_cast<int>(lhs[depth * i + k]) + LHS_OFFSET) *
+                    (static_cast<int>(rhs[j + k * cols]) + RHS_OFFSET);
+      }
+      float expected_float = static_cast<float>(expected) * FLOAT_SCALE;
+      float actual = results[i * cols + j];
+      if (actual != expected_float) {
+        wrong++;
+      }
+    }
+  }
+  return wrong == 0;
+}
+
+bool check_row_row_i32(std::uint8_t* lhs, std::uint8_t* rhs, std::int32_t* results, int rows,
+                       int cols, int depth) {
+  int wrong = 0;
+  for (int i = 0; i < rows; ++i) {
+    for (int j = 0; j < cols; ++j) {
+      int expected = 0;
+      for (int k = 0; k < depth; ++k) {
+        expected += (static_cast<int>(lhs[depth * i + k]) + LHS_OFFSET) *
+                    (static_cast<int>(rhs[depth * j + k]) + RHS_OFFSET);
+      }
+      int actual = results[i * cols + j];
+      if (actual != expected) {
+        wrong++;
+      }
+    }
+  }
+  return wrong == 0;
+}
+
+bool check_row_col_i32(std::uint8_t* lhs, std::uint8_t* rhs, std::int32_t* results, int rows,
+                       int cols, int depth) {
+  int wrong = 0;
+  for (int i = 0; i < rows; ++i) {
+    for (int j = 0; j < cols; ++j) {
+      int expected = 0;
+      for (int k = 0; k < depth; ++k) {
+        expected += (static_cast<int>(lhs[depth * i + k]) + LHS_OFFSET) *
+                    (static_cast<int>(rhs[j + k * cols]) + RHS_OFFSET);
+      }
+      int actual = results[i * cols + j];
+      if (actual != expected) {
+        wrong++;
+      }
+    }
+  }
+  return wrong == 0;
+}
+
+template <typename PARAMS, typename RESULT_TYPE>
+void setup_params(std::uint8_t* lhs, std::uint8_t* rhs, RESULT_TYPE* result,
+                  std::uint8_t* scratch, PARAMS* params) {
+  params->lhs = lhs;
+  params->rhs = rhs;
+  params->result = result;
+  params->scratch = scratch;
+
+  params->left_stream.multiplicative_sum_offset = RHS_OFFSET;
+  params->left_stream.additive_sum_offset = 0;
+
+  params->right_stream.multiplicative_sum_offset = LHS_OFFSET;
+  params->right_stream.additive_sum_offset = 0;
+}
+
+void setup_row_row(int m, int n, int k, ParamsRowMajor* params) {
+  params->m = m;
+  params->n = n;
+  params->k = k;
+  params->left_stream.count = k;
+  params->left_stream.stride = k;
+  params->left_stream.additive_sum_offset =
+      SUM_OFFSET * k + k * LHS_OFFSET * RHS_OFFSET;
+  params->right_stream.count = k;
+  params->right_stream.stride = k;
+  params->fused_kernel.kernel.count = k;
+  params->fused_kernel.kernel.multiplicative_offset = MUL_OFFSET;
+  params->fused_kernel.kernel.rounding_offset = (1 << (SHIFT - 1));
+  params->fused_kernel.kernel.shift = -SHIFT;
+  params->fused_kernel.output_stream.stride = n;
+}
+
+void setup_row_col(int m, int n, int k, ParamsColumnMajor* params) {
+  params->m = m;
+  params->n = n;
+  params->k = k;
+  params->left_stream.count = k;
+  params->left_stream.stride = k;
+  params->left_stream.additive_sum_offset =
+      SUM_OFFSET * k + k * LHS_OFFSET * RHS_OFFSET;
+  params->right_stream.count = k;
+  params->right_stream.stride = n;
+  params->fused_kernel.kernel.count = k;
+  params->fused_kernel.kernel.multiplicative_offset = MUL_OFFSET;
+  params->fused_kernel.kernel.rounding_offset = (1 << (SHIFT - 1));
+  params->fused_kernel.kernel.shift = -SHIFT;
+  params->fused_kernel.output_stream.stride = n;
+}
+
+void setup_row_row_f(int m, int n, int k, ParamsRowMajorAsFloat* params) {
+  params->m = m;
+  params->n = n;
+  params->k = k;
+  params->left_stream.count = k;
+  params->left_stream.stride = k;
+  params->left_stream.additive_sum_offset = k * LHS_OFFSET * RHS_OFFSET;
+  params->right_stream.count = k;
+  params->right_stream.stride = k;
+  params->fused_kernel.kernel.count = k;
+  params->fused_kernel.kernel.scale = FLOAT_SCALE;
+  params->fused_kernel.output_stream.stride = n * sizeof(float);
+}
+
+void setup_row_col_f(int m, int n, int k, ParamsColumnMajorAsFloat* params) {
+  params->m = m;
+  params->n = n;
+  params->k = k;
+  params->left_stream.count = k;
+  params->left_stream.stride = k;
+  params->left_stream.additive_sum_offset = k * LHS_OFFSET * RHS_OFFSET;
+  params->right_stream.count = k;
+  params->right_stream.stride = n;
+  params->fused_kernel.kernel.count = k;
+  params->fused_kernel.kernel.scale = FLOAT_SCALE;
+  params->fused_kernel.output_stream.stride = n * sizeof(float);
+}
+
+void setup_row_row_i32(int m, int n, int k, ParamsRowMajorAsInt32* params) {
+  params->m = m;
+  params->n = n;
+  params->k = k;
+  params->left_stream.count = k;
+  params->left_stream.stride = k;
+  params->left_stream.additive_sum_offset = k * LHS_OFFSET * RHS_OFFSET;
+  params->right_stream.count = k;
+  params->right_stream.stride = k;
+  params->fused_kernel.kernel.count = k;
+  params->fused_kernel.output_stream.stride = n * sizeof(std::int32_t);
+}
+
+void setup_row_col_i32(int m, int n, int k, ParamsColumnMajorAsInt32* params) {
+  params->m = m;
+  params->n = n;
+  params->k = k;
+  params->left_stream.count = k;
+  params->left_stream.stride = k;
+  params->left_stream.additive_sum_offset = k * LHS_OFFSET * RHS_OFFSET;
+  params->right_stream.count = k;
+  params->right_stream.stride = n;
+  params->fused_kernel.kernel.count = k;
+  params->fused_kernel.output_stream.stride = n * sizeof(std::int32_t);
+}
+
+int main() {
+  ParamsRowMajor params_row;
+  ParamsColumnMajor params_col;
+  ParamsRowMajorAsFloat params_row_f;
+  ParamsColumnMajorAsFloat params_col_f;
+  ParamsRowMajorAsInt32 params_row_i32;
+  ParamsColumnMajorAsInt32 params_col_i32;
+
+  std::unique_ptr<std::uint8_t> lhs(new std::uint8_t[1024 * 1024]);
+  std::unique_ptr<std::uint8_t> rhs(new std::uint8_t[1024 * 1024]);
+  std::unique_ptr<std::uint8_t> result(new std::uint8_t[1024 * 1024]);
+  std::unique_ptr<float> result_f(new float[1024 * 1024]);
+  std::unique_ptr<std::int32_t> result_i32(new std::int32_t[1024 * 1024]);
+  std::unique_ptr<std::uint8_t> scratch(new std::uint8_t[4048 * 1024]);
+
+  setup_params(lhs.get(), rhs.get(), result.get(), scratch.get(), &params_row);
+  setup_params(lhs.get(), rhs.get(), result.get(), scratch.get(), &params_col);
+  setup_params(lhs.get(), rhs.get(), result_f.get(), scratch.get(),
+               &params_row_f);
+  setup_params(lhs.get(), rhs.get(), result_f.get(), scratch.get(),
+               &params_col_f);
+  setup_params(lhs.get(), rhs.get(), result_i32.get(), scratch.get(),
+               &params_row_i32);
+  setup_params(lhs.get(), rhs.get(), result_i32.get(), scratch.get(),
+               &params_col_i32);
+
+  Pool pool;
+  Context context(4, &pool);
+
+  for (int i = 1; i < 16; ++i) {
+    for (int j = 1; j < 16; ++j) {
+      for (int k = 1; k < 24; ++k) {
+        prepare_test_data(lhs.get(), i, k, 11, 13);
+        prepare_test_data(rhs.get(), j, k, 13, 17);
+
+        clear(i, j, result.get());
+        setup_row_row(i, j, k, &params_row);
+        Gemm<Executor, ParamsRowMajor, 2, 4, 8>(params_row);
+        if (!check_row_row(lhs.get(), rhs.get(), result.get(), i, j, k)) {
+          std::cout << "Row: " << i << "x" << j << "x" << k << " : ERROR"
+                    << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result.get());
+        setup_row_col(i, j, k, &params_col);
+        Gemm<Executor, ParamsColumnMajor, 2, 4, 8>(params_col);
+        if (!check_row_col(lhs.get(), rhs.get(), result.get(), i, j, k)) {
+          std::cout << "Column: " << i << "x" << j << "x" << k << " : ERROR"
+                    << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result_f.get());
+        setup_row_row_f(i, j, k, &params_row_f);
+        Gemm<Executor, ParamsRowMajorAsFloat, 2, 4, 8>(params_row_f);
+        if (!check_row_row_f(lhs.get(), rhs.get(), result_f.get(), i, j, k)) {
+          std::cout << "RowAsFloat: " << i << "x" << j << "x" << k << " : ERROR"
+                    << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result_f.get());
+        setup_row_col_f(i, j, k, &params_col_f);
+        Gemm<Executor, ParamsColumnMajorAsFloat, 2, 4, 8>(params_col_f);
+        if (!check_row_col_f(lhs.get(), rhs.get(), result_f.get(), i, j, k)) {
+          std::cout << "ColumnAsFloat: " << i << "x" << j << "x" << k
+                    << " : ERROR" << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result_i32.get());
+        setup_row_row_i32(i, j, k, &params_row_i32);
+        Gemm<Executor, ParamsRowMajorAsInt32, 2, 4, 8>(params_row_i32);
+        if (!check_row_row_i32(lhs.get(), rhs.get(), result_i32.get(), i, j,
+                               k)) {
+          std::cout << "RowAsInt32: " << i << "x" << j << "x" << k << " : ERROR"
+                    << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result_i32.get());
+        setup_row_col_i32(i, j, k, &params_col_i32);
+        Gemm<Executor, ParamsColumnMajorAsInt32, 2, 4, 8>(params_col_i32);
+        if (!check_row_col_i32(lhs.get(), rhs.get(), result_i32.get(), i, j,
+                               k)) {
+          std::cout << "ColumnAsInt32: " << i << "x" << j << "x" << k
+                    << " : ERROR" << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+      }
+    }
+  }
+
+  for (int i = 1; i < 1024; i += 211) {
+    for (int j = 1; j < 1024; j += 211) {
+      for (int k = 8; k < 1024; k += 111) {
+        prepare_test_data(lhs.get(), i, k, 11, 13);
+        prepare_test_data(rhs.get(), j, k, 13, 17);
+
+        clear(i, j, result.get());
+        setup_row_row(i, j, k, &params_row);
+        MultiThreadGemm<Context, Executor, ParamsRowMajor, 2, 4, 8>(&context,
+                                                                    params_row);
+        if (!check_row_row(lhs.get(), rhs.get(), result.get(), i, j, k)) {
+          std::cout << "Row(MT): " << i << "x" << j << "x" << k << " : ERROR"
+                    << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result.get());
+        setup_row_col(i, j, k, &params_col);
+        MultiThreadGemm<Context, Executor, ParamsColumnMajor, 2, 4, 8>(
+            &context, params_col);
+        if (!check_row_col(lhs.get(), rhs.get(), result.get(), i, j, k)) {
+          std::cout << "Column(MT): " << i << "x" << j << "x" << k << " : ERROR"
+                    << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result_f.get());
+        setup_row_row_f(i, j, k, &params_row_f);
+        MultiThreadGemm<Context, Executor, ParamsRowMajorAsFloat, 2, 4, 8>(
+            &context, params_row_f);
+        if (!check_row_row_f(lhs.get(), rhs.get(), result_f.get(), i, j, k)) {
+          std::cout << "RowAsFloat(MT): " << i << "x" << j << "x" << k
+                    << " : ERROR" << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result_f.get());
+        setup_row_col_f(i, j, k, &params_col_f);
+        MultiThreadGemm<Context, Executor, ParamsColumnMajorAsFloat, 2, 4, 8>(
+            &context, params_col_f);
+        if (!check_row_col_f(lhs.get(), rhs.get(), result_f.get(), i, j, k)) {
+          std::cout << "ColumnAsFloat(MT): " << i << "x" << j << "x" << k
+                    << " : ERROR" << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result_i32.get());
+        setup_row_row_i32(i, j, k, &params_row_i32);
+        MultiThreadGemm<Context, Executor, ParamsRowMajorAsInt32, 2, 4, 8>(
+            &context, params_row_i32);
+        if (!check_row_row_i32(lhs.get(), rhs.get(), result_i32.get(), i, j,
+                               k)) {
+          std::cout << "RowAsInt32(MT): " << i << "x" << j << "x" << k
+                    << " : ERROR" << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+
+        clear(i, j, result_i32.get());
+        setup_row_col_i32(i, j, k, &params_col_i32);
+        MultiThreadGemm<Context, Executor, ParamsColumnMajorAsInt32, 2, 4, 8>(
+            &context, params_col_i32);
+        if (!check_row_col_i32(lhs.get(), rhs.get(), result_i32.get(), i, j,
+                               k)) {
+          std::cout << "ColumnAsInt32(MT): " << i << "x" << j << "x" << k
+                    << " : ERROR" << std::endl;
+          std::cout << "Exiting." << std::endl;
+          std::exit(1);
+        }
+      }
+    }
+  }
+
+  std::cout << "OK." << std::endl;
+  return 0;
+}
diff --git a/meta/test_streams_correctness.cc b/meta/test_streams_correctness.cc
new file mode 100644
index 0000000..7beb812
--- /dev/null
+++ b/meta/test_streams_correctness.cc
@@ -0,0 +1,182 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <unistd.h>
+#ifdef __APPLE__
+#include <sys/time.h>
+#endif
+
+#include <cstdint>
+#include <cstdlib>
+#include <ctime>
+#include <iomanip>
+#include <iostream>
+#include <map>
+#include <memory>
+#include <vector>
+
+#include "streams.h"
+
+#define MUL_OFFSET (3)
+#define ADD_OFFSET (100)
+
+using namespace gemmlowp::meta;
+
+void prepare_row_major_data(int rows, int elements, int stride, std::uint8_t* data) {
+  for (int i = 0; i < rows * stride; ++i) {
+    data[i] = 255;
+  }
+  for (int i = 0; i < rows; ++i) {
+    for (int j = 0; j < elements; ++j) {
+      data[i * stride + j] = j % 256;
+    }
+  }
+}
+
+void prepare_column_major_data(int columns, int elements, int stride,
+                               std::uint8_t* data) {
+  for (int i = 0; i < elements * stride; ++i) {
+    data[i] = 255;
+  }
+  for (int i = 0; i < elements; ++i) {
+    for (int j = 0; j < columns; ++j) {
+      data[i * stride + j] = i % 256;
+    }
+  }
+}
+
+void print_out(std::uint8_t* result, int rows, int elements) {
+  int size = rows * ((elements + 7) / 8) * 8;
+  for (int i = 0; i < size; ++i) {
+    std::cout << static_cast<int>(result[i]) << " ";
+  }
+  std::cout << std::endl << std::flush;
+}
+
+bool check(std::uint8_t* result, int rows, int elements) {
+  int chunks = elements / 8;
+  int leftover = elements % 8;
+  for (int i = 0; i < chunks; ++i) {
+    int chunk_index = i * rows * 8;
+    int chunk_start_value = i * 8;
+    for (int j = 0; j < rows; ++j) {
+      for (int k = 0; k < 8; ++k) {
+        if (result[chunk_index + j * 8 + k] != chunk_start_value + k) {
+          return false;
+        }
+      }
+    }
+  }
+
+  int leftover_index = chunks * rows * 8;
+  int leftover_start_value = chunks * 8;
+  for (int i = 0; i < rows; ++i) {
+    for (int j = 0; j < leftover; ++j) {
+      if (result[leftover_index + i * 8 + j] != leftover_start_value + j) {
+        return false;
+      }
+    }
+  }
+
+  int expected_sum =
+      ((elements * (elements - 1)) / 2) * MUL_OFFSET + ADD_OFFSET;
+  int sums_offset = rows * ((elements + 7) / 8) * 8;
+  std::int32_t* sums = reinterpret_cast<std::int32_t*>(result + sums_offset);
+  for (int i = 0; i < rows; ++i) {
+    if (sums[i] != expected_sum) {
+      return false;
+    }
+  }
+
+  return true;
+}
+
+template <int lanes, int leftover>
+void test_2(std::uint8_t* in, std::uint8_t* out) {
+  for (int elements = 8; elements < 64; elements += 8) {
+    int all_elements = elements + leftover;
+    for (int stride = all_elements; stride < all_elements + 4; ++stride) {
+      RowMajorWithSum params;
+      params.count = all_elements;
+      params.stride = stride;
+      params.multiplicative_sum_offset = MUL_OFFSET;
+      params.additive_sum_offset = ADD_OFFSET;
+
+      prepare_row_major_data(lanes, all_elements, stride, in);
+      Stream<std::uint8_t, lanes, 8, leftover, RowMajorWithSum>::Pack(in, params,
+                                                                 out);
+      if (check(out, lanes, all_elements)) {
+        //        std::cout << "Row: " << lanes << "x8x" << leftover << " : "
+        //                  << all_elements << "@" << stride << " -- OK" <<
+        //                  std::endl;
+      } else {
+        std::cout << "Row: " << lanes << "x8x" << leftover << " : "
+                  << all_elements << "@" << stride << " -- ERROR" << std::endl;
+        std::cout << "Exiting." << std::endl;
+        std::exit(1);
+      }
+    }
+
+    for (int stride = lanes; stride < lanes + 4; ++stride) {
+      ColumnMajorWithSum params;
+      params.count = all_elements;
+      params.stride = stride;
+      params.multiplicative_sum_offset = MUL_OFFSET;
+      params.additive_sum_offset = ADD_OFFSET;
+
+      prepare_column_major_data(lanes, all_elements, stride, in);
+      Stream<std::uint8_t, lanes, 8, leftover, ColumnMajorWithSum>::Pack(in, params,
+                                                                    out);
+      if (check(out, lanes, all_elements)) {
+        //        std::cout << "Column: " << lanes << "x8x" << leftover << " : "
+        //                  << all_elements << "@" << stride << " -- OK" <<
+        //                  std::endl;
+      } else {
+        std::cout << "Column: " << lanes << "x8x" << leftover << " : "
+                  << all_elements << "@" << stride << " -- ERROR" << std::endl;
+        std::cout << "Exiting." << std::endl;
+        std::exit(1);
+      }
+    }
+  }
+}
+
+template <int lanes>
+void test(std::uint8_t* in, std::uint8_t* out) {
+  test_2<lanes, 0>(in, out);
+  test_2<lanes, 1>(in, out);
+  test_2<lanes, 2>(in, out);
+  test_2<lanes, 3>(in, out);
+  test_2<lanes, 4>(in, out);
+  test_2<lanes, 5>(in, out);
+  test_2<lanes, 6>(in, out);
+  test_2<lanes, 7>(in, out);
+}
+
+int main() {
+  std::unique_ptr<std::uint8_t> in(new std::uint8_t[128 * 1024]);
+  std::unique_ptr<std::uint8_t> out(new std::uint8_t[128 * 1024]);
+
+  test<1>(in.get(), out.get());
+  test<2>(in.get(), out.get());
+  test<3>(in.get(), out.get());
+  test<4>(in.get(), out.get());
+  test<5>(in.get(), out.get());
+  test<6>(in.get(), out.get());
+  test<7>(in.get(), out.get());
+  test<8>(in.get(), out.get());
+
+  std::cout << "Ok." << std::endl;
+  return 0;
+}
diff --git a/meta/test_transform_benchmark.cc b/meta/test_transform_benchmark.cc
new file mode 100644
index 0000000..db8ab7d
--- /dev/null
+++ b/meta/test_transform_benchmark.cc
@@ -0,0 +1,151 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <unistd.h>
+#ifdef __APPLE__
+#include <sys/time.h>
+#endif
+
+#include <cstdint>
+#include <cstdlib>
+#include <ctime>
+#include <iomanip>
+#include <iostream>
+#include <map>
+#include <memory>
+#include <vector>
+
+#include "multi_thread_transform.h"
+#include "transform_kernels.h"
+
+using namespace gemmlowp::meta;
+
+double time() {
+#ifdef __APPLE__
+  timeval t;
+  gettimeofday(&t, nullptr);
+  return t.tv_sec + 1e-6 * t.tv_usec;
+#else
+  timespec t;
+  clock_gettime(CLOCK_REALTIME, &t);
+  return t.tv_sec + 1e-9 * t.tv_nsec;
+#endif
+}
+
+#define kernel_size (16)
+
+template <typename Context, typename Params>
+void run_benchmark(const std::string& name, int repetitions, int elements,
+                   Context* context, const Params& params) {
+  std::cout << "Benchmark: " << name << std::endl;
+  std::cout << "Warmup single." << std::endl;
+
+  for (int i = 0; i < 10; ++i) {
+    Transform1D<Params, kernel_size>(params);
+  }
+
+  std::cout << "Benchmark single." << std::endl;
+
+  double start = time();
+
+  for (int i = 0; i < repetitions; ++i) {
+    Transform1D<Params, kernel_size>(params);
+  }
+
+  double wall_time = time() - start;
+  double ops = static_cast<double>(elements) * repetitions;
+  std::cout << "Avg: " << (wall_time / repetitions) << std::endl;
+  std::cout << "Perf: " << static_cast<std::int64_t>(ops / wall_time) << "/s."
+            << std::endl;
+
+  std::cout << "Warmup single." << std::endl;
+
+  for (int i = 0; i < 10; ++i) {
+    MultiThreadTransform1D<Context, Params, kernel_size>(context, params);
+  }
+
+  std::cout << "Benchmark multi." << std::endl;
+
+  start = time();
+
+  for (int i = 0; i < repetitions; ++i) {
+    MultiThreadTransform1D<Context, Params, kernel_size>(context, params);
+  }
+
+  wall_time = time() - start;
+  ops = static_cast<double>(elements) * repetitions;
+  std::cout << "Avg: " << (wall_time / repetitions) << std::endl;
+  std::cout << "Perf: " << static_cast<std::int64_t>(ops / wall_time) << "/s."
+            << std::endl;
+}
+
+int main() {
+  const int repetitions = 500;
+  const int elements = 4 * 1024 * 1024;
+
+  std::unique_ptr<std::int32_t[]> int32_array(new std::int32_t[elements]);
+  std::unique_ptr<std::uint8_t[]> uint8_array(new std::uint8_t[elements]);
+  std::unique_ptr<float[]> float_array(new float[elements]);
+
+  typedef SimpleContext<gemmlowp::WorkersPool> Context;
+  Context context(4, new gemmlowp::WorkersPool());
+
+  typedef Transform1DParams<std::int32_t, std::uint8_t, Requantize> RequantizeParams;
+  RequantizeParams requantize_params;
+  requantize_params.input = int32_array.get();
+  requantize_params.output = uint8_array.get();
+  requantize_params.kernel.count = elements;
+  requantize_params.kernel.input_range_min = -100.0f;
+  requantize_params.kernel.input_range_scale =
+      200.0f / ((static_cast<std::int64_t>(1) << 32) - 1);
+  requantize_params.kernel.input_range_offset =
+      static_cast<float>(std::numeric_limits<std::int32_t>::lowest());
+  requantize_params.kernel.output_range_min = -200.0f;
+  requantize_params.kernel.one_over_output_range_scale =
+      static_cast<float>((static_cast<std::int64_t>(1) << 8) - 1) / 500.0f;
+  requantize_params.kernel.output_range_offset =
+      static_cast<float>(std::numeric_limits<std::uint8_t>::lowest());
+
+  run_benchmark("Requantize", repetitions, elements, &context,
+                requantize_params);
+
+  typedef Transform1DParams<std::uint8_t, float, Dequantize> DequantizeParams;
+  DequantizeParams dequantize_params;
+  dequantize_params.input = uint8_array.get();
+  dequantize_params.output = float_array.get();
+  dequantize_params.kernel.count = elements;
+  dequantize_params.kernel.range_min = -100.0f;
+  dequantize_params.kernel.range_scale =
+      static_cast<float>((static_cast<std::int64_t>(1) << 8) - 1) / 200.0f;
+  dequantize_params.kernel.range_offset =
+      static_cast<float>(std::numeric_limits<std::uint8_t>::lowest());
+
+  run_benchmark("Dequantize", repetitions, elements, &context,
+                dequantize_params);
+
+  typedef Transform1DParams<float, std::uint8_t, Quantize> QuantizeParams;
+  QuantizeParams quantize_params;
+  quantize_params.input = float_array.get();
+  quantize_params.output = uint8_array.get();
+  quantize_params.kernel.count = elements;
+  quantize_params.kernel.range_min = -100.0f;
+  quantize_params.kernel.range_scale =
+      200.0f / ((static_cast<std::int64_t>(1) << 8) - 1);
+  quantize_params.kernel.range_offset =
+      static_cast<float>(std::numeric_limits<std::uint8_t>::lowest());
+
+  run_benchmark("Quantize", repetitions, elements, &context, quantize_params);
+
+  return 0;
+}
diff --git a/meta/test_transform_correctness.cc b/meta/test_transform_correctness.cc
new file mode 100644
index 0000000..e781ae3
--- /dev/null
+++ b/meta/test_transform_correctness.cc
@@ -0,0 +1,285 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <unistd.h>
+#ifdef __APPLE__
+#include <sys/time.h>
+#endif
+
+#include <cstdint>
+#include <cstdlib>
+#include <ctime>
+#include <iomanip>
+#include <iostream>
+#include <map>
+#include <memory>
+#include <vector>
+
+#include "single_thread_transform.h"
+#include "transform_kernels.h"
+
+#define EPSILON (0.0001)
+
+using namespace gemmlowp::meta;
+
+typedef Transform1DParams<std::int32_t, std::uint8_t, Requantize> RequantizeParams;
+typedef Transform1DParams<float, std::uint8_t, Quantize> QuantizeParams;
+typedef Transform1DParams<std::uint8_t, float, Dequantize> DequantizeParams;
+typedef Transform1DParams<std::uint8_t, std::uint8_t, MinMax<std::uint8_t>> MinMaxParams;
+typedef Transform1DParams<std::uint8_t, std::int32_t, BiasAdd<std::uint8_t>> BiasAddParams;
+
+void prepare_data_requantize(int count, std::int32_t* data) {
+  float scale = 4000000000.0f / static_cast<float>(count - 1);
+  for (int i = 0; i < count; ++i) {
+    float temp = -2000000000.0f + scale * i;
+    data[i] = static_cast<std::int32_t>(temp);
+  }
+}
+
+void prepare_data_quantize(int count, float* data) {
+  float scale = 200.0f / static_cast<float>(count - 1);
+  for (int i = 0; i < count; ++i) {
+    data[i] = -100 + scale * i;
+  }
+}
+
+void prepare_data_dequantize(int count, std::uint8_t* data) {
+  for (int i = 0; i < count; ++i) {
+    data[i] = static_cast<std::uint8_t>(i % 256);
+  }
+}
+
+void prepare_data_minmax(int count, std::uint8_t* data) {
+  for (int i = 0; i < count; ++i) {
+    data[i] = static_cast<std::uint8_t>(i % 256);
+  }
+}
+
+void prepare_data_biasadd(int count, std::uint8_t* data) {
+  for (int i = 0; i < count; ++i) {
+    data[i] = static_cast<std::uint8_t>(i % 256);
+  }
+}
+
+void verify_requantize(const RequantizeParams& params) {
+  for (int i = 0; i < params.kernel.count; ++i) {
+    std::uint8_t actual = params.output[i];
+    float expected = static_cast<float>(params.input[i]);
+    expected -= params.kernel.input_range_offset;
+    expected *= params.kernel.input_range_scale;
+    expected += params.kernel.input_range_min;
+    expected -= params.kernel.output_range_min;
+    expected *= params.kernel.one_over_output_range_scale;
+    expected += params.kernel.output_range_offset;
+    std::uint8_t expected_uint8 = static_cast<std::uint8_t>(expected);
+
+    if (actual != expected_uint8) {
+      std::cout << "Wrong: " << i << " : " << actual << " vs. "
+                << expected_uint8 << std::endl;
+      std::exit(1);
+    }
+  }
+  std::cout << "Requantize: OK" << std::endl;
+}
+
+void verify_quantize(const QuantizeParams& params) {
+  for (int i = 0; i < params.kernel.count; ++i) {
+    std::uint8_t actual = params.output[i];
+    float expected = params.input[i];
+    expected -= params.kernel.range_min;
+    expected *= params.kernel.range_scale;
+    expected += params.kernel.range_offset;
+    std::uint8_t expected_uint8 = static_cast<std::uint8_t>(expected);
+
+    if (actual != expected_uint8) {
+      std::cout << "Wrong: " << i << " : " << actual << " vs. "
+                << expected_uint8 << std::endl;
+      std::exit(1);
+    }
+  }
+  std::cout << "Quantize: OK" << std::endl;
+}
+
+void verify_dequantize(const DequantizeParams& params) {
+  for (int i = 0; i < params.kernel.count; ++i) {
+    float actual = params.output[i];
+    float expected = static_cast<float>(params.input[i]);
+    expected -= params.kernel.range_offset;
+    expected *= params.kernel.range_scale;
+    expected += params.kernel.range_min;
+    if (std::abs(actual - expected) > EPSILON) {
+      std::cout << std::setprecision(9) << "Wrong: " << i << " : " << actual
+                << " vs. " << expected << std::endl;
+      std::exit(1);
+    }
+  }
+  std::cout << "Dequantize: OK" << std::endl;
+}
+
+void verify_minmax(const MinMaxParams& params) {
+  for (int i = 0; i < params.kernel.count; ++i) {
+    std::uint8_t actual = params.output[i];
+    std::uint8_t expected = params.input[i];
+    expected = std::min(expected, params.kernel.max);
+    expected = std::max(expected, params.kernel.min);
+
+    if (actual != expected) {
+      std::cout << "Wrong: " << i << " : " << actual << " vs. " << expected
+                << std::endl;
+      std::exit(1);
+    }
+  }
+  std::cout << "MinMax: OK" << std::endl;
+}
+
+void verify_biasadd(const BiasAddParams& params) {
+  for (int i = 0; i < params.kernel.rows * params.kernel.count; ++i) {
+    std::int32_t actual = params.output[i];
+    std::uint8_t input = params.input[i];
+    std::uint8_t bias = params.kernel.bias[i % params.kernel.count];
+    float input_float = static_cast<float>(input);
+    input_float -= params.kernel.input_range_offset;
+    input_float *= params.kernel.input_range_scale;
+    input_float += params.kernel.input_range_min;
+    float bias_float = static_cast<float>(bias);
+    bias_float -= params.kernel.bias_range_offset;
+    bias_float *= params.kernel.bias_range_scale;
+    bias_float += params.kernel.bias_range_min;
+    float sum = input_float + bias_float;
+    sum -= params.kernel.output_range_min;
+    sum *= params.kernel.one_over_output_range_scale;
+    sum += params.kernel.output_range_offset;
+    std::int32_t expected = static_cast<std::int32_t>(sum);
+    if (std::abs(actual - expected) > 1024) {
+      std::cout << "Wrong: " << i << " : " << actual << " vs. " << expected
+                << std::endl;
+      std::exit(1);
+    }
+  }
+  std::cout << "BiasAdd: OK" << std::endl;
+}
+
+int main() {
+  std::unique_ptr<std::int32_t[]> array_int32(new std::int32_t[128 * 1024]);
+  std::unique_ptr<std::uint8_t[]> array_uint8(new std::uint8_t[128 * 1024]);
+  std::unique_ptr<std::uint8_t[]> array_uint8_2(new std::uint8_t[128 * 1024]);
+  std::unique_ptr<float[]> array_float(new float[128 * 1024]);
+
+  {
+    RequantizeParams requantize_params;
+    requantize_params.input = array_int32.get();
+    requantize_params.output = array_uint8.get();
+    requantize_params.kernel.count = 12345;
+    requantize_params.kernel.input_range_min = -100.0f;
+    requantize_params.kernel.input_range_scale =
+        200.0f / ((static_cast<std::int64_t>(1) << 32) - 1);
+    requantize_params.kernel.input_range_offset =
+        static_cast<float>(std::numeric_limits<std::int32_t>::lowest());
+    requantize_params.kernel.output_range_min = -100.f;
+    requantize_params.kernel.one_over_output_range_scale =
+        static_cast<float>((static_cast<std::int64_t>(1) << 8) - 1) / 200.0f;
+    requantize_params.kernel.output_range_offset =
+        static_cast<float>(std::numeric_limits<std::uint8_t>::lowest());
+
+    prepare_data_requantize(12345, array_int32.get());
+
+    Transform1D<RequantizeParams, 16>(requantize_params);
+
+    verify_requantize(requantize_params);
+  }
+
+  {
+    QuantizeParams quantize_params;
+    quantize_params.input = array_float.get();
+    quantize_params.output = array_uint8.get();
+    quantize_params.kernel.count = 12345;
+    quantize_params.kernel.range_min = -100.0f;
+    quantize_params.kernel.range_scale =
+        static_cast<float>((static_cast<std::int64_t>(1) << 8) - 1) / 200.0f;
+    quantize_params.kernel.range_offset =
+        static_cast<float>(std::numeric_limits<std::uint8_t>::lowest());
+
+    prepare_data_quantize(12345, array_float.get());
+
+    Transform1D<QuantizeParams, 16>(quantize_params);
+
+    verify_quantize(quantize_params);
+  }
+
+  {
+    DequantizeParams dequantize_params;
+    dequantize_params.input = array_uint8.get();
+    dequantize_params.output = array_float.get();
+    dequantize_params.kernel.count = 12345;
+    dequantize_params.kernel.range_min = -100.0f;
+    dequantize_params.kernel.range_scale =
+        200.0f / ((static_cast<std::int64_t>(1) << 8) - 1);
+    dequantize_params.kernel.range_offset =
+        static_cast<float>(std::numeric_limits<std::uint8_t>::lowest());
+
+    prepare_data_dequantize(12345, array_uint8.get());
+
+    Transform1D<DequantizeParams, 16>(dequantize_params);
+
+    verify_dequantize(dequantize_params);
+  }
+
+  {
+    MinMaxParams minmax_params;
+    minmax_params.input = array_uint8.get();
+    minmax_params.output = array_uint8_2.get();
+    minmax_params.kernel.count = 12345;
+    minmax_params.kernel.min = 64;
+    minmax_params.kernel.max = 192;
+
+    prepare_data_minmax(12345, array_uint8.get());
+
+    Transform1D<MinMaxParams, 16>(minmax_params);
+
+    verify_minmax(minmax_params);
+  }
+
+  {
+    BiasAddParams biasadd_params;
+    biasadd_params.input = array_uint8.get();
+    biasadd_params.output = array_int32.get();
+    biasadd_params.kernel.count = 1234;
+    biasadd_params.kernel.rows = 11;
+    biasadd_params.kernel.input_range_min = -100.0f;
+    biasadd_params.kernel.bias_range_min = -100.0f;
+    biasadd_params.kernel.output_range_min = -250.0f;
+    biasadd_params.kernel.input_range_offset =
+        static_cast<float>(std::numeric_limits<std::uint8_t>::lowest());
+    biasadd_params.kernel.bias_range_offset =
+        static_cast<float>(std::numeric_limits<std::uint8_t>::lowest());
+    biasadd_params.kernel.output_range_offset =
+        static_cast<float>(std::numeric_limits<std::int32_t>::lowest());
+    biasadd_params.kernel.input_range_scale =
+        200.0f / ((static_cast<std::int64_t>(1) << 8) - 1);
+    biasadd_params.kernel.bias_range_scale =
+        200.0f / ((static_cast<std::int64_t>(1) << 8) - 1);
+    biasadd_params.kernel.one_over_output_range_scale =
+        static_cast<float>((static_cast<std::int64_t>(1) << 32) - 1) / 500.0f;
+    biasadd_params.kernel.bias = array_uint8_2.get();
+
+    prepare_data_biasadd(1234 * 11, array_uint8.get());
+    prepare_data_biasadd(1234, array_uint8_2.get());
+
+    Transform1D<BiasAddParams, 16>(biasadd_params);
+
+    verify_biasadd(biasadd_params);
+  }
+
+  return 0;
+}
diff --git a/meta/transform_kernels.h b/meta/transform_kernels.h
new file mode 100644
index 0000000..4489656
--- /dev/null
+++ b/meta/transform_kernels.h
@@ -0,0 +1,244 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_TRANSFORM_KERNELS_H_
+#define GEMMLOWP_META_TRANSFORM_KERNELS_H_
+
+#include "base.h"
+
+namespace gemmlowp {
+namespace meta {
+
+struct Quantize {
+  float range_min;
+  float range_offset;
+  float range_scale;
+  int count;
+};
+
+struct Dequantize {
+  float range_min;
+  float range_offset;
+  float range_scale;
+  int count;
+};
+
+struct Requantize {
+  float input_range_min;
+  float input_range_offset;
+  float input_range_scale;
+  float output_range_min;
+  float output_range_offset;
+  float one_over_output_range_scale;
+  int count;
+};
+
+template <typename Type>
+struct MinMax {
+  Type min;
+  Type max;
+  int count;
+};
+
+template <typename BiasType>
+struct BiasAdd {
+  float input_range_min;
+  float input_range_offset;
+  float input_range_scale;
+  float bias_range_min;
+  float bias_range_offset;
+  float bias_range_scale;
+  float output_range_min;
+  float output_range_offset;
+  float one_over_output_range_scale;
+  int count;
+  int rows;
+  const BiasType* bias;
+};
+
+template <typename InType, typename OutType, int kernel_size, int leftovers>
+class Transform1DKernel<InType, OutType, Quantize, kernel_size, leftovers> {
+ public:
+  static void Transform(const InType* in, const Quantize& params,
+                        OutType* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Quantize::Transform(" << std::string(typeid(InType).name())
+              << ", " << std::string(typeid(OutType).name()) << ") -- "
+              << kernel_size << "x" << leftovers << std::endl;
+#endif
+#else
+    std::cerr << "FATAL: Quantize::Transform not implemented." << std::endl;
+    std::exit(1);
+#endif
+  }
+};
+
+template <typename InType, typename OutType, int kernel_size, int leftovers>
+class Transform1DKernel<InType, OutType, Dequantize, kernel_size, leftovers> {
+ public:
+  static void Transform(const InType* in, const Dequantize& params,
+                        OutType* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Dequantize::Transform(" << std::string(typeid(InType).name())
+              << ", " << std::string(typeid(OutType).name()) << ") -- "
+              << kernel_size << "x" << leftovers << std::endl;
+#endif
+#else
+    std::cerr << "FATAL: Dequantize::Transform not implemented." << std::endl;
+    std::exit(1);
+#endif
+  }
+};
+
+template <typename InType, typename OutType, int kernel_size, int leftovers>
+class Transform1DKernel<InType, OutType, Requantize, kernel_size, leftovers> {
+ public:
+  static void Transform(const InType* in, const Requantize& params,
+                        OutType* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "Requantize::Transform(" << std::string(typeid(InType).name())
+              << ", " << std::string(typeid(OutType).name()) << ") -- "
+              << kernel_size << "x" << leftovers << std::endl;
+#endif
+#else
+    std::cerr << "FATAL: Requantize::Transform not implemented." << std::endl;
+    std::exit(1);
+#endif
+  }
+};
+
+template <typename InType, typename OutType, int kernel_size, int leftovers,
+          typename Type>
+class Transform1DKernel<InType, OutType, MinMax<Type>, kernel_size, leftovers> {
+ public:
+  static void Transform(const InType* in, const MinMax<Type>& params,
+                        OutType* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "MinMax::Transform(" << std::string(typeid(InType).name())
+              << ", " << std::string(typeid(OutType).name()) << ") -- "
+              << kernel_size << "x" << leftovers << std::endl;
+#endif
+#else
+    std::cerr << "FATAL: MinMax::Transform not implemented." << std::endl;
+    std::exit(1);
+#endif
+  }
+};
+
+template <typename InType, typename OutType, int kernel_size, int leftovers,
+          typename Type>
+class Transform1DKernel<InType, OutType, BiasAdd<Type>, kernel_size,
+                        leftovers> {
+ public:
+  static void Transform(const InType* in, const BiasAdd<Type>& params,
+                        OutType* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+    std::cout << "BiasAdd::Transform(" << std::string(typeid(InType).name())
+              << ", " << std::string(typeid(OutType).name()) << ") -- "
+              << kernel_size << "x" << leftovers << std::endl;
+#endif
+#else
+    std::cerr << "FATAL: BiasAdd::Transform not implemented." << std::endl;
+    std::exit(1);
+#endif
+  }
+};
+
+template <typename InType, typename OutType>
+class Transform1DUtil<InType, OutType, Quantize> {
+ public:
+  static int EstimateComputeCost(const Quantize& params) {
+    return params.count * 8;
+  }
+
+  static const InType* OffsetInput(const Quantize& params, const InType* input,
+                                   int offset) {
+    return input + offset;
+  }
+
+  static OutType* OffsetOutput(const Quantize& params, OutType* output,
+                               int offset) {
+    return output + offset;
+  }
+};
+
+template <typename InType, typename OutType>
+class Transform1DUtil<InType, OutType, Requantize> {
+ public:
+  static int EstimateComputeCost(const Requantize& params) {
+    return params.count * 12;
+  }
+
+  static const InType* OffsetInput(const Requantize& params,
+                                   const InType* input, int offset) {
+    return input + offset;
+  }
+
+  static OutType* OffsetOutput(const Requantize& params, OutType* output,
+                               int offset) {
+    return output + offset;
+  }
+};
+
+template <typename InType, typename OutType>
+class Transform1DUtil<InType, OutType, Dequantize> {
+ public:
+  static int EstimateComputeCost(const Dequantize& params) {
+    return params.count * 12;
+  }
+
+  static const InType* OffsetInput(const Dequantize& params,
+                                   const InType* input, int offset) {
+    return input + offset;
+  }
+
+  static OutType* OffsetOutput(const Dequantize& params, OutType* output,
+                               int offset) {
+    return output + offset;
+  }
+};
+
+template <typename InType, typename OutType, typename MinMaxType>
+class Transform1DUtil<InType, OutType, MinMax<MinMaxType>> {
+ public:
+  static int EstimateComputeCost(const MinMax<MinMaxType>& params) {
+    return params.count * 4;
+  }
+
+  static const InType* OffsetInput(const MinMax<MinMaxType>& params,
+                                   const InType* input, int offset) {
+    return input + offset;
+  }
+
+  static OutType* OffsetOutput(const MinMax<MinMaxType>& params,
+                               OutType* output, int offset) {
+    return output + offset;
+  }
+};
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#ifdef GEMMLOWP_NEON_32
+#include "transform_kernels_arm_32.h"
+#elif defined(GEMMLOWP_NEON_64)
+#include "transform_kernels_arm_64.h"
+#endif
+
+#endif  // GEMMLOWP_META_TRANSFORM_KERNELS_H_
diff --git a/meta/transform_kernels_arm_32.h b/meta/transform_kernels_arm_32.h
new file mode 100644
index 0000000..64e744e
--- /dev/null
+++ b/meta/transform_kernels_arm_32.h
@@ -0,0 +1,8109 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_TRANSFORM_KERNELS_ARM_32_H_
+#define GEMMLOWP_META_TRANSFORM_KERNELS_ARM_32_H_
+
+#ifdef GEMMLOWP_NEON_32
+
+#include <cassert>
+#include <cstdint>
+
+namespace gemmlowp {
+namespace meta {
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 0>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 1>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.8 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 2>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.16 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 3>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.16 {d0[0]}, [%[output]]!\n"
+      "vst1.8 {d0[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 4>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 5>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d2[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "vst1.8 {d0[4]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 6>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "vst1.16 {d0[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 7>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2}, [%[input]]!\n"
+      "vld1.32 {d3[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "vst1.16 {d0[2]}, [%[output]]!\n"
+      "vst1.8 {d0[6]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 8>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #8\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 9>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #9\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.8 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 10>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #10\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.16 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 11>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #11\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4}, [%[input]]!\n"
+      "vld1.32 {d5[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.16 {d1[0]}, [%[output]]!\n"
+      "vst1.8 {d1[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 12>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #12\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 13>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #13\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5}, [%[input]]!\n"
+      "vld1.32 {d6[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "vst1.8 {d1[4]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 14>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #14\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "vst1.16 {d1[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 15>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "vdup.32 q4, %[input_range_min]\n"
+      "vdup.32 q5, %[output_range_min]\n"
+      "vdup.32 q6, %[input_range_offset]\n"
+      "vdup.32 q7, %[input_range_scale]\n"
+      "vdup.32 q8, %[one_over_output_range_scale]\n"
+      "vsub.f32 q4, q4, q5\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #15\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6}, [%[input]]!\n"
+      "vld1.32 {d7[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q6\n"
+      "vsub.f32 q1, q1, q6\n"
+      "vsub.f32 q2, q2, q6\n"
+      "vsub.f32 q3, q3, q6\n"
+      "vmul.f32 q0, q0, q7\n"
+      "vmul.f32 q1, q1, q7\n"
+      "vmul.f32 q2, q2, q7\n"
+      "vmul.f32 q3, q3, q7\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q8\n"
+      "vmul.f32 q1, q1, q8\n"
+      "vmul.f32 q2, q2, q8\n"
+      "vmul.f32 q3, q3, q8\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "vst1.16 {d1[2]}, [%[output]]!\n"
+      "vst1.8 {d1[6]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "d14", "d15", "d16", "d17", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 0>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 1>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.8 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 2>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.16 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 3>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.16 {d0[0]}, [%[output]]!\n"
+      "vst1.8 {d0[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 4>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 5>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d2[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "vst1.8 {d0[4]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 6>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "vst1.16 {d0[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 7>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2}, [%[input]]!\n"
+      "vld1.32 {d3[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "vst1.16 {d0[2]}, [%[output]]!\n"
+      "vst1.8 {d0[6]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 8>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #8\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovun.s16 d0, q0\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 9>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #9\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.8 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 10>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #10\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.16 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 11>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #11\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4}, [%[input]]!\n"
+      "vld1.32 {d5[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.16 {d1[0]}, [%[output]]!\n"
+      "vst1.8 {d1[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 12>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #12\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 13>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #13\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5}, [%[input]]!\n"
+      "vld1.32 {d6[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "vst1.8 {d1[4]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 14>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #14\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "vst1.16 {d1[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 15>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #15\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6, d7}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "vld1.32 {d0, d1, d2, d3}, [%[input]]!\n"
+      "vld1.32 {d4, d5, d6}, [%[input]]!\n"
+      "vld1.32 {d7[0]}, [%[input]]!\n"
+      "pld [%[input], #64]\n"
+      "vsub.f32 q0, q0, q4\n"
+      "vsub.f32 q1, q1, q4\n"
+      "vsub.f32 q2, q2, q4\n"
+      "vsub.f32 q3, q3, q4\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q5\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vadd.f32 q3, q3, q5\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+      "vqmovn.s32 d0, q0\n"
+      "vqmovn.s32 d1, q1\n"
+      "vqmovn.s32 d4, q2\n"
+      "vqmovn.s32 d5, q3\n"
+      "vqmovun.s16 d0, q0\n"
+      "vqmovun.s16 d1, q2\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "vst1.16 {d1[2]}, [%[output]]!\n"
+      "vst1.8 {d1[6]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 0>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 1>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.8 {d0[0]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 2>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.16 {d0[0]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 3>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.16 {d0[0]}, [%[input]]!\n"
+      "vld1.8 {d0[2]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 4>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 5>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.8 {d0[4]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "vst1.32 {d2[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 6>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.16 {d0[2]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+
+      "vst1.32 {d0, d1, d2}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 7>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.16 {d0[2]}, [%[input]]!\n"
+      "vld1.8 {d0[6]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+
+      "vst1.32 {d0, d1, d2}, [%[output]]!\n"
+      "vst1.32 {d3[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 8>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #8\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 9>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #9\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.8 {d1[0]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 10>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #10\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.16 {d1[0]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 11>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #11\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.16 {d1[0]}, [%[input]]!\n"
+      "vld1.8 {d1[2]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4}, [%[output]]!\n"
+      "vst1.32 {d5[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 12>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #12\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 13>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #13\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.8 {d1[4]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5}, [%[output]]!\n"
+      "vst1.32 {d6[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 14>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #14\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.16 {d1[2]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 15>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "vdup.32 q4, %[range_min]\n"
+      "vdup.32 q5, %[range_offset]\n"
+      "vdup.32 q6, %[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #15\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // Dequantize::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.16 {d1[2]}, [%[input]]!\n"
+      "vld1.8 {d1[6]}, [%[input]]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vsub.f32 q0, q0, q5\n"
+      "vsub.f32 q1, q1, q5\n"
+      "vsub.f32 q2, q2, q5\n"
+      "vsub.f32 q3, q3, q5\n"
+      "vmul.f32 q0, q0, q6\n"
+      "vmul.f32 q1, q1, q6\n"
+      "vmul.f32 q2, q2, q6\n"
+      "vmul.f32 q3, q3, q6\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q4\n"
+      "vadd.f32 q3, q3, q4\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6}, [%[output]]!\n"
+      "vst1.32 {d7[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10",
+        "d11", "d12", "d13", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              0>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              1>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.8 {d0[0]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.8 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              2>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.16 {d0[0]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.16 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              3>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.16 {d0[0]}, [%[input]]!\n"
+      "vld1.8 {d0[2]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.16 {d0[0]}, [%[output]]!\n"
+      "vst1.8 {d0[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              4>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              5>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.8 {d0[4]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "vst1.8 {d0[4]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              6>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.16 {d0[2]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "vst1.16 {d0[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              7>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.16 {d0[2]}, [%[input]]!\n"
+      "vld1.8 {d0[6]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "vst1.16 {d0[2]}, [%[output]]!\n"
+      "vst1.8 {d0[6]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              8>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #8\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              9>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #9\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.8 {d1[0]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.8 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              10>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #10\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.16 {d1[0]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.16 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              11>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #11\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.16 {d1[0]}, [%[input]]!\n"
+      "vld1.8 {d1[2]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.16 {d1[0]}, [%[output]]!\n"
+      "vst1.8 {d1[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              12>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #12\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              13>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #13\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.8 {d1[4]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "vst1.8 {d1[4]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              14>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #14\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.16 {d1[2]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "vst1.16 {d1[2]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              15>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "vdup.8 q4, %[min]\n"
+      "vdup.8 q5, %[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %[count], %[count], #15\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %[count], %[count], #16\n"
+
+      // MinMax::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.16 {d1[2]}, [%[input]]!\n"
+      "vld1.8 {d1[6]}, [%[input]]!\n"
+      "pld [%[input], #16]\n"
+      "vmax.u8 q0, q0, q4\n"
+      "vmin.u8 q0, q0, q5\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "vst1.16 {d1[2]}, [%[output]]!\n"
+      "vst1.8 {d1[6]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "d0", "d1", "d8", "d9", "d10", "d11", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              0>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              1>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #1\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.8 {d0[0]}, [%[input]]!\n"
+      "vld1.8 {d2[0]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q1, d2\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q1, d2\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q10\n"
+      "vadd.f32 q0, q0, q1\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+
+      "vst1.32 {d0[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              2>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #2\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.16 {d0[0]}, [%[input]]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q1, d2\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q1, d2\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q10\n"
+      "vadd.f32 q0, q0, q1\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              3>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #3\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.16 {d0[0]}, [%[input]]!\n"
+      "vld1.8 {d0[2]}, [%[input]]!\n"
+      "vld1.16 {d2[0]}, [r1]!\n"
+      "vld1.8 {d2[2]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q1, d2\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q1, d2\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q10\n"
+      "vadd.f32 q0, q0, q1\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+
+      "vst1.32 {d0}, [%[output]]!\n"
+      "vst1.32 {d1[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              4>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #4\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.32 {d2[0]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q1, d2\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q1, d2\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q10\n"
+      "vadd.f32 q0, q0, q1\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              5>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #5\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.8 {d0[4]}, [%[input]]!\n"
+      "vld1.32 {d4[0]}, [r1]!\n"
+      "vld1.8 {d4[4]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q2, d4\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q3, d5\n"
+      "vmovl.s16 q2, d4\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q11\n"
+      "vmul.f32 q3, q3, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q10\n"
+      "vadd.f32 q3, q3, q10\n"
+      "vadd.f32 q0, q0, q2\n"
+      "vadd.f32 q1, q1, q3\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+
+      "vst1.32 {d0, d1}, [%[output]]!\n"
+      "vst1.32 {d2[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              6>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #6\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.16 {d0[2]}, [%[input]]!\n"
+      "vld1.32 {d4[0]}, [r1]!\n"
+      "vld1.16 {d4[2]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q2, d4\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q3, d5\n"
+      "vmovl.s16 q2, d4\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q11\n"
+      "vmul.f32 q3, q3, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q10\n"
+      "vadd.f32 q3, q3, q10\n"
+      "vadd.f32 q0, q0, q2\n"
+      "vadd.f32 q1, q1, q3\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+
+      "vst1.32 {d0, d1, d2}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              7>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #7\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0[0]}, [%[input]]!\n"
+      "vld1.16 {d0[2]}, [%[input]]!\n"
+      "vld1.8 {d0[6]}, [%[input]]!\n"
+      "vld1.32 {d4[0]}, [r1]!\n"
+      "vld1.16 {d4[2]}, [r1]!\n"
+      "vld1.8 {d4[6]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q2, d4\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q3, d5\n"
+      "vmovl.s16 q2, d4\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q11\n"
+      "vmul.f32 q3, q3, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q10\n"
+      "vadd.f32 q3, q3, q10\n"
+      "vadd.f32 q0, q0, q2\n"
+      "vadd.f32 q1, q1, q3\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+
+      "vst1.32 {d0, d1, d2}, [%[output]]!\n"
+      "vst1.32 {d3[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              8>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #8\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d4}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q2, d4\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q3, d5\n"
+      "vmovl.s16 q2, d4\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q11\n"
+      "vmul.f32 q3, q3, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q10\n"
+      "vadd.f32 q3, q3, q10\n"
+      "vadd.f32 q0, q0, q2\n"
+      "vadd.f32 q1, q1, q3\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              9>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #9\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.8 {d1[0]}, [%[input]]!\n"
+      "vld1.32 {d6}, [r1]!\n"
+      "vld1.8 {d7[0]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q4, d7\n"
+      "vmovl.u8 q3, d6\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q5, d8\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q4, d7\n"
+      "vmovl.s16 q3, d6\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q11\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q10\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q0, q0, q3\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              10>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #10\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.16 {d1[0]}, [%[input]]!\n"
+      "vld1.32 {d6}, [r1]!\n"
+      "vld1.16 {d7[0]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q4, d7\n"
+      "vmovl.u8 q3, d6\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q5, d8\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q4, d7\n"
+      "vmovl.s16 q3, d6\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q11\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q10\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q0, q0, q3\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              11>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #11\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.16 {d1[0]}, [%[input]]!\n"
+      "vld1.8 {d1[2]}, [%[input]]!\n"
+      "vld1.32 {d6}, [r1]!\n"
+      "vld1.16 {d7[0]}, [r1]!\n"
+      "vld1.8 {d7[2]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q4, d7\n"
+      "vmovl.u8 q3, d6\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q5, d8\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q4, d7\n"
+      "vmovl.s16 q3, d6\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q11\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q10\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q0, q0, q3\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4}, [%[output]]!\n"
+      "vst1.32 {d5[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              12>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #12\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.32 {d6}, [r1]!\n"
+      "vld1.32 {d7[0]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q4, d7\n"
+      "vmovl.u8 q3, d6\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q5, d8\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q4, d7\n"
+      "vmovl.s16 q3, d6\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q11\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q10\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q0, q0, q3\n"
+      "vadd.f32 q1, q1, q4\n"
+      "vadd.f32 q2, q2, q5\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              13>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #13\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.8 {d1[4]}, [%[input]]!\n"
+      "vld1.32 {d8}, [r1]!\n"
+      "vld1.32 {d9[0]}, [r1]!\n"
+      "vld1.8 {d9[4]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5}, [%[output]]!\n"
+      "vst1.32 {d6[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              14>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #14\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.16 {d1[2]}, [%[input]]!\n"
+      "vld1.32 {d8}, [r1]!\n"
+      "vld1.32 {d9[0]}, [r1]!\n"
+      "vld1.16 {d9[2]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              15>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr r0, %[input_range_min]\n"
+      "vdup.32 q8, r0\n"
+      "ldr r0, %[input_range_scale]\n"
+      "vdup.32 q9, r0\n"
+      "ldr r0, %[bias_range_min]\n"
+      "vdup.32 q10, r0\n"
+      "ldr r0, %[bias_range_scale]\n"
+      "vdup.32 q11, r0\n"
+      "ldr r0, %[output_range_min]\n"
+      "vdup.32 q12, r0\n"
+      "ldr r0, %[one_over_output_range_scale]\n"
+      "vdup.32 q13, r0\n"
+      "ldr r0, %[output_range_offset]\n"
+      "vdup.32 q14, r0\n"
+      "1:"
+      "mov r0, %[count]\n"
+      "mov r1, %[bias]\n"
+      "subs r0, r0, #15\n"
+      "beq 3f\n"
+      "2:"
+      "subs r0, r0, #16\n"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0, d1}, [%[input]]!\n"
+      "vld1.32 {d8, d9}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6, d7}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "vld1.32 {d0}, [%[input]]!\n"
+      "vld1.32 {d1[0]}, [%[input]]!\n"
+      "vld1.16 {d1[2]}, [%[input]]!\n"
+      "vld1.8 {d1[6]}, [%[input]]!\n"
+      "vld1.32 {d8}, [r1]!\n"
+      "vld1.32 {d9[0]}, [r1]!\n"
+      "vld1.16 {d9[2]}, [r1]!\n"
+      "vld1.8 {d9[6]}, [r1]!\n"
+      "pld [%[input], #32]\n"
+      "vmovl.u8 q1, d1\n"
+      "vmovl.u8 q0, d0\n"
+      "vmovl.u8 q5, d9\n"
+      "vmovl.u8 q4, d8\n"
+      "vmovl.s16 q3, d3\n"
+      "vmovl.s16 q2, d2\n"
+      "vmovl.s16 q7, d11\n"
+      "vmovl.s16 q6, d10\n"
+      "vmovl.s16 q1, d1\n"
+      "vmovl.s16 q0, d0\n"
+      "vmovl.s16 q5, d9\n"
+      "vmovl.s16 q4, d8\n"
+      "vcvt.f32.s32 q0, q0\n"
+      "vcvt.f32.s32 q1, q1\n"
+      "vcvt.f32.s32 q2, q2\n"
+      "vcvt.f32.s32 q3, q3\n"
+      "vcvt.f32.s32 q4, q4\n"
+      "vcvt.f32.s32 q5, q5\n"
+      "vcvt.f32.s32 q6, q6\n"
+      "vcvt.f32.s32 q7, q7\n"
+      "vmul.f32 q0, q0, q9\n"
+      "vmul.f32 q1, q1, q9\n"
+      "vmul.f32 q2, q2, q9\n"
+      "vmul.f32 q3, q3, q9\n"
+      "vmul.f32 q4, q4, q11\n"
+      "vmul.f32 q5, q5, q11\n"
+      "vmul.f32 q6, q6, q11\n"
+      "vmul.f32 q7, q7, q11\n"
+      "vadd.f32 q0, q0, q8\n"
+      "vadd.f32 q1, q1, q8\n"
+      "vadd.f32 q2, q2, q8\n"
+      "vadd.f32 q3, q3, q8\n"
+      "vadd.f32 q4, q4, q10\n"
+      "vadd.f32 q5, q5, q10\n"
+      "vadd.f32 q6, q6, q10\n"
+      "vadd.f32 q7, q7, q10\n"
+      "vadd.f32 q0, q0, q4\n"
+      "vadd.f32 q1, q1, q5\n"
+      "vadd.f32 q2, q2, q6\n"
+      "vadd.f32 q3, q3, q7\n"
+      "vsub.f32 q0, q0, q12\n"
+      "vsub.f32 q1, q1, q12\n"
+      "vsub.f32 q2, q2, q12\n"
+      "vsub.f32 q3, q3, q12\n"
+      "vmul.f32 q0, q0, q13\n"
+      "vmul.f32 q1, q1, q13\n"
+      "vmul.f32 q2, q2, q13\n"
+      "vmul.f32 q3, q3, q13\n"
+      "vadd.f32 q0, q0, q14\n"
+      "vadd.f32 q1, q1, q14\n"
+      "vadd.f32 q2, q2, q14\n"
+      "vadd.f32 q3, q3, q14\n"
+      "vcvt.s32.f32 q0, q0\n"
+      "vcvt.s32.f32 q1, q1\n"
+      "vcvt.s32.f32 q2, q2\n"
+      "vcvt.s32.f32 q3, q3\n"
+
+      "vst1.32 {d0, d1, d2, d3}, [%[output]]!\n"
+      "vst1.32 {d4, d5, d6}, [%[output]]!\n"
+      "vst1.32 {d7[0]}, [%[output]]!\n"
+      "pld [%[output]]\n"
+      "subs %[rows], %[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "r0", "r1", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
+        "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
+        "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27", "d28", "d29",
+        "cc", "memory");
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm for arm32 requires: GEMMLOWP_NEON_32!"
+#endif
+
+#endif  // GEMMLOWP_META_TRANSFORM_KERNELS_ARM_32_H_
diff --git a/meta/transform_kernels_arm_64.h b/meta/transform_kernels_arm_64.h
new file mode 100644
index 0000000..c4d43ff
--- /dev/null
+++ b/meta/transform_kernels_arm_64.h
@@ -0,0 +1,7965 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef GEMMLOWP_META_TRANSFORM_KERNELS_ARM_64_H_
+#define GEMMLOWP_META_TRANSFORM_KERNELS_ARM_64_H_
+
+#ifdef GEMMLOWP_NEON_64
+
+#include <cassert>
+#include <cstdint>
+
+namespace gemmlowp {
+namespace meta {
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 0>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 1>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.b}[0], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 2>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.h}[0], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 3>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.h}[0], [%x[output]], #2\n"
+      "st1 {v0.b}[2], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 4>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 5>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v1.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "st1 {v0.b}[4], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 6>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v1.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "st1 {v0.h}[2], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 7>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v1.2s}, [%x[input]], #8\n"
+      "ld1 {v1.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "st1 {v0.h}[2], [%x[output]], #2\n"
+      "st1 {v0.b}[6], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 8>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #8\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s}, [%x[input]], #32\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 9>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #9\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s}, [%x[input]], #32\n"
+      "ld1 {v2.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.b}[8], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 10>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #10\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s}, [%x[input]], #32\n"
+      "ld1 {v2.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.h}[4], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 11>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #11\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s}, [%x[input]], #32\n"
+      "ld1 {v2.2s}, [%x[input]], #8\n"
+      "ld1 {v2.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.h}[4], [%x[output]], #2\n"
+      "st1 {v0.b}[10], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 12>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #12\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s}, [%x[input]], #48\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 13>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #13\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s}, [%x[input]], #48\n"
+      "ld1 {v3.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "st1 {v0.b}[12], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 14>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #14\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s}, [%x[input]], #48\n"
+      "ld1 {v3.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "st1 {v0.h}[6], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<int32_t, uint8_t, Requantize, 16, 15>::Transform(
+    const int32_t* input, const Requantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Requantize<int32_t, uint8_t, Requantize, 16, 15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Requantize::Prepare
+      "dup v4.4s, %w[input_range_min]\n"
+      "dup v5.4s, %w[output_range_min]\n"
+      "dup v6.4s, %w[input_range_offset]\n"
+      "dup v7.4s, %w[input_range_scale]\n"
+      "dup v8.4s, %w[one_over_output_range_scale]\n"
+      "fsub v4.4s, v4.4s, v5.4s\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #15\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Requantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s}, [%x[input]], #48\n"
+      "ld1 {v3.2s}, [%x[input]], #8\n"
+      "ld1 {v3.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v6.4s\n"
+      "fsub v1.4s, v1.4s, v6.4s\n"
+      "fsub v2.4s, v2.4s, v6.4s\n"
+      "fsub v3.4s, v3.4s, v6.4s\n"
+      "fmul v0.4s, v0.4s, v7.4s\n"
+      "fmul v1.4s, v1.4s, v7.4s\n"
+      "fmul v2.4s, v2.4s, v7.4s\n"
+      "fmul v3.4s, v3.4s, v7.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v8.4s\n"
+      "fmul v1.4s, v1.4s, v8.4s\n"
+      "fmul v2.4s, v2.4s, v8.4s\n"
+      "fmul v3.4s, v3.4s, v8.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "st1 {v0.h}[6], [%x[output]], #2\n"
+      "st1 {v0.b}[14], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [input_range_min] "r"(params.input_range_min),
+        [output_range_min] "r"(params.output_range_min),
+        [input_range_offset] "r"(params.input_range_offset),
+        [one_over_output_range_scale] "r"(params.one_over_output_range_scale),
+        [input_range_scale] "r"(params.input_range_scale)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 0>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 1>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.b}[0], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 2>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.h}[0], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 3>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.h}[0], [%x[output]], #2\n"
+      "st1 {v0.b}[2], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 4>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 5>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v1.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "st1 {v0.b}[4], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 6>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v1.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "st1 {v0.h}[2], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 7>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v1.2s}, [%x[input]], #8\n"
+      "ld1 {v1.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "st1 {v0.h}[2], [%x[output]], #2\n"
+      "st1 {v0.b}[6], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 8>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #8\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s}, [%x[input]], #32\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 9>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #9\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s}, [%x[input]], #32\n"
+      "ld1 {v2.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.b}[8], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 10>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #10\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s}, [%x[input]], #32\n"
+      "ld1 {v2.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.h}[4], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 11>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #11\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s}, [%x[input]], #32\n"
+      "ld1 {v2.2s}, [%x[input]], #8\n"
+      "ld1 {v2.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.h}[4], [%x[output]], #2\n"
+      "st1 {v0.b}[10], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 12>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #12\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s}, [%x[input]], #48\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 13>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #13\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s}, [%x[input]], #48\n"
+      "ld1 {v3.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "st1 {v0.b}[12], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 14>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #14\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s}, [%x[input]], #48\n"
+      "ld1 {v3.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "st1 {v0.h}[6], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<float, uint8_t, Quantize, 16, 15>::Transform(
+    const float* input, const Quantize& params, uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Quantize<float, uint8_t, Quantize, 16, 15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Quantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #15\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[input]], #64\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Quantize::Transform
+      "ld1 {v0.4s, v1.4s, v2.4s}, [%x[input]], #48\n"
+      "ld1 {v3.2s}, [%x[input]], #8\n"
+      "ld1 {v3.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #64]\n"
+      "fsub v0.4s, v0.4s, v4.4s\n"
+      "fsub v1.4s, v1.4s, v4.4s\n"
+      "fsub v2.4s, v2.4s, v4.4s\n"
+      "fsub v3.4s, v3.4s, v4.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v5.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fadd v3.4s, v3.4s, v5.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+      "sqxtn v0.4h, v0.4s\n"
+      "sqxtn2 v0.8h, v1.4s\n"
+      "sqxtn v2.4h, v2.4s\n"
+      "sqxtn2 v2.8h, v3.4s\n"
+      "sqxtun v0.8b, v0.8h\n"
+      "sqxtun2 v0.16b, v2.8h\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "st1 {v0.h}[6], [%x[output]], #2\n"
+      "st1 {v0.b}[14], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 0>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 1>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.b}[0], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 2>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.h}[0], [%x[input]], #2\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 3>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.h}[0], [%x[input]], #2\n"
+      "ld1 {v0.b}[2], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 4>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 5>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v0.b}[4], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "st1 {v1.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 6>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v0.h}[2], [%x[input]], #2\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "st1 {v1.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 7>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v0.h}[2], [%x[input]], #2\n"
+      "ld1 {v0.b}[6], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "st1 {v1.2s}, [%x[output]], #8\n"
+      "st1 {v1.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 8>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #8\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s}, [%x[output]], #32\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 9>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #9\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.b}[8], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s}, [%x[output]], #32\n"
+      "st1 {v2.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 10>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #10\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.h}[4], [%x[input]], #2\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s}, [%x[output]], #32\n"
+      "st1 {v2.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 11>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #11\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.h}[4], [%x[input]], #2\n"
+      "ld1 {v0.b}[10], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s}, [%x[output]], #32\n"
+      "st1 {v2.2s}, [%x[output]], #8\n"
+      "st1 {v2.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 12>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #12\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s}, [%x[output]], #48\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 13>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #13\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v0.b}[12], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s}, [%x[output]], #48\n"
+      "st1 {v3.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 14>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #14\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v0.h}[6], [%x[input]], #2\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s}, [%x[output]], #48\n"
+      "st1 {v3.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, float, Dequantize, 16, 15>::Transform(
+    const uint8_t* input, const Dequantize& params, float* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") Dequantize<uint8_t, float, Dequantize, 16, 15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // Dequantize::Prepare
+      "dup v4.4s, %w[range_min]\n"
+      "dup v5.4s, %w[range_offset]\n"
+      "dup v6.4s, %w[range_scale]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #15\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // Dequantize::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // Dequantize::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v0.h}[6], [%x[input]], #2\n"
+      "ld1 {v0.b}[14], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v5.4s\n"
+      "fsub v1.4s, v1.4s, v5.4s\n"
+      "fsub v2.4s, v2.4s, v5.4s\n"
+      "fsub v3.4s, v3.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v6.4s\n"
+      "fmul v1.4s, v1.4s, v6.4s\n"
+      "fmul v2.4s, v2.4s, v6.4s\n"
+      "fmul v3.4s, v3.4s, v6.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v4.4s\n"
+      "fadd v3.4s, v3.4s, v4.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s}, [%x[output]], #48\n"
+      "st1 {v3.2s}, [%x[output]], #8\n"
+      "st1 {v3.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [range_offset] "r"(params.range_offset),
+        [range_scale] "r"(params.range_scale), [range_min] "r"(params.range_min)
+      : "v0", "v1", "v2", "v3", "v4", "v5", "v6", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              0>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              1>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #1\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.b}[0], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.b}[0], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              2>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #2\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.h}[0], [%x[input]], #2\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.h}[0], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              3>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #3\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.h}[0], [%x[input]], #2\n"
+      "ld1 {v0.b}[2], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.h}[0], [%x[output]], #2\n"
+      "st1 {v0.b}[2], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              4>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #4\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              5>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #5\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v0.b}[4], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "st1 {v0.b}[4], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              6>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #6\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v0.h}[2], [%x[input]], #2\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "st1 {v0.h}[2], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              7>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #7\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v0.h}[2], [%x[input]], #2\n"
+      "ld1 {v0.b}[6], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "st1 {v0.h}[2], [%x[output]], #2\n"
+      "st1 {v0.b}[6], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              8>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #8\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              9>::Transform(const uint8_t* input,
+                                            const MinMax<uint8_t>& params,
+                                            uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #9\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.b}[8], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.b}[8], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              10>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #10\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.h}[4], [%x[input]], #2\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.h}[4], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              11>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #11\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.h}[4], [%x[input]], #2\n"
+      "ld1 {v0.b}[10], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.h}[4], [%x[output]], #2\n"
+      "st1 {v0.b}[10], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              12>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #12\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              13>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #13\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v0.b}[12], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "st1 {v0.b}[12], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              14>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #14\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v0.h}[6], [%x[input]], #2\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "st1 {v0.h}[6], [%x[output]], #2\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, uint8_t, MinMax<uint8_t>, 16,
+                              15>::Transform(const uint8_t* input,
+                                             const MinMax<uint8_t>& params,
+                                             uint8_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") MinMax<uint8_t><uint8_t, uint8_t, MinMax<uint8_t>, 16, "
+               "15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_count_copy = params.count;
+  asm volatile(
+
+      // MinMax::Prepare
+      "dup v4.16b, %w[min]\n"
+      "dup v5.16b, %w[max]\n"
+
+      // Reduce count by leftovers.
+      "subs %x[count], %x[count], #15\n"
+      "beq 2f\n"
+
+      "1:"
+      "subs %x[count], %x[count], #16\n"
+
+      // MinMax::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+
+      "bne 1b\n"
+      "2:"
+
+      // Handle leftovers.
+
+      // MinMax::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v0.h}[6], [%x[input]], #2\n"
+      "ld1 {v0.b}[14], [%x[input]], #1\n"
+      "prfm pldl1keep, [%x[input], #16]\n"
+      "umax v0.16b, v0.16b, v4.16b\n"
+      "umin v0.16b, v0.16b, v5.16b\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "st1 {v0.h}[6], [%x[output]], #2\n"
+      "st1 {v0.b}[14], [%x[output]], #1\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      : [count] "+r"(params_count_copy), [input] "+r"(input),
+        [output] "+r"(output)
+      : [max] "r"(params.max), [min] "r"(params.min)
+      : "v0", "v4", "v5", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              0>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "0>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              1>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "1>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #1\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.b}[0], [%x[input]], #1\n"
+      "ld1 {v1.b}[0], [x1], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl v1.8h, v1.8b\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl v1.4s, v1.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+
+      "st1 {v0.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              2>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "2>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #2\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.h}[0], [%x[input]], #2\n"
+      "ld1 {v1.h}[0], [x1], #2\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl v1.8h, v1.8b\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl v1.4s, v1.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              3>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "3>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #3\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.h}[0], [%x[input]], #2\n"
+      "ld1 {v0.b}[2], [%x[input]], #1\n"
+      "ld1 {v1.h}[0], [x1], #2\n"
+      "ld1 {v1.b}[2], [x1], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl v1.8h, v1.8b\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl v1.4s, v1.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+
+      "st1 {v0.2s}, [%x[output]], #8\n"
+      "st1 {v0.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              4>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "4>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #4\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v1.s}[0], [x1], #4\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl v1.8h, v1.8b\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl v1.4s, v1.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v1.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              5>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "5>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #5\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v0.b}[4], [%x[input]], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.b}[4], [x1], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl v2.8h, v2.8b\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v3.4s, v2.8h\n"
+      "sxtl v2.4s, v2.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v11.4s\n"
+      "fmul v3.4s, v3.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v10.4s\n"
+      "fadd v3.4s, v3.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v2.4s\n"
+      "fadd v1.4s, v1.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "st1 {v1.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              6>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "6>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #6\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v0.h}[2], [%x[input]], #2\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl v2.8h, v2.8b\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v3.4s, v2.8h\n"
+      "sxtl v2.4s, v2.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v11.4s\n"
+      "fmul v3.4s, v3.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v10.4s\n"
+      "fadd v3.4s, v3.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v2.4s\n"
+      "fadd v1.4s, v1.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "st1 {v1.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              7>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "7>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #7\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.s}[0], [%x[input]], #4\n"
+      "ld1 {v0.h}[2], [%x[input]], #2\n"
+      "ld1 {v0.b}[6], [%x[input]], #1\n"
+      "ld1 {v2.s}[0], [x1], #4\n"
+      "ld1 {v2.h}[2], [x1], #2\n"
+      "ld1 {v2.b}[6], [x1], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl v2.8h, v2.8b\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v3.4s, v2.8h\n"
+      "sxtl v2.4s, v2.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v11.4s\n"
+      "fmul v3.4s, v3.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v10.4s\n"
+      "fadd v3.4s, v3.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v2.4s\n"
+      "fadd v1.4s, v1.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+
+      "st1 {v0.4s}, [%x[output]], #16\n"
+      "st1 {v1.2s}, [%x[output]], #8\n"
+      "st1 {v1.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              8>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "8>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #8\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v2.2s}, [x1], #8\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl v2.8h, v2.8b\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v3.4s, v2.8h\n"
+      "sxtl v2.4s, v2.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v11.4s\n"
+      "fmul v3.4s, v3.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v10.4s\n"
+      "fadd v3.4s, v3.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v2.4s\n"
+      "fadd v1.4s, v1.4s, v3.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+
+      "st1 {v0.4s, v1.4s}, [%x[output]], #32\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              9>::Transform(const uint8_t* input,
+                                            const BiasAdd<uint8_t>& params,
+                                            int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "9>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #9\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.b}[8], [%x[input]], #1\n"
+      "ld1 {v3.2s}, [x1], #8\n"
+      "ld1 {v3.b}[8], [x1], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v4.8h, v3.16b\n"
+      "uxtl v3.8h, v3.8b\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl v5.4s, v4.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v4.4s, v3.8h\n"
+      "sxtl v3.4s, v3.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v11.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v10.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v3.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+
+      "st1 {v0.4s, v1.4s}, [%x[output]], #32\n"
+      "st1 {v2.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              10>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "10>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #10\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.h}[4], [%x[input]], #2\n"
+      "ld1 {v3.2s}, [x1], #8\n"
+      "ld1 {v3.h}[4], [x1], #2\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v4.8h, v3.16b\n"
+      "uxtl v3.8h, v3.8b\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl v5.4s, v4.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v4.4s, v3.8h\n"
+      "sxtl v3.4s, v3.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v11.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v10.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v3.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+
+      "st1 {v0.4s, v1.4s}, [%x[output]], #32\n"
+      "st1 {v2.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              11>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "11>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #11\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.h}[4], [%x[input]], #2\n"
+      "ld1 {v0.b}[10], [%x[input]], #1\n"
+      "ld1 {v3.2s}, [x1], #8\n"
+      "ld1 {v3.h}[4], [x1], #2\n"
+      "ld1 {v3.b}[10], [x1], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v4.8h, v3.16b\n"
+      "uxtl v3.8h, v3.8b\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl v5.4s, v4.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v4.4s, v3.8h\n"
+      "sxtl v3.4s, v3.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v11.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v10.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v3.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+
+      "st1 {v0.4s, v1.4s}, [%x[output]], #32\n"
+      "st1 {v2.2s}, [%x[output]], #8\n"
+      "st1 {v2.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              12>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "12>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #12\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v3.2s}, [x1], #8\n"
+      "ld1 {v3.s}[2], [x1], #4\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v4.8h, v3.16b\n"
+      "uxtl v3.8h, v3.8b\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl v5.4s, v4.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v4.4s, v3.8h\n"
+      "sxtl v3.4s, v3.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v11.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v10.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v3.4s\n"
+      "fadd v1.4s, v1.4s, v4.4s\n"
+      "fadd v2.4s, v2.4s, v5.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s}, [%x[output]], #48\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              13>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "13>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #13\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v0.b}[12], [%x[input]], #1\n"
+      "ld1 {v4.2s}, [x1], #8\n"
+      "ld1 {v4.s}[2], [x1], #4\n"
+      "ld1 {v4.b}[12], [x1], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s}, [%x[output]], #48\n"
+      "st1 {v3.s}[0], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              14>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "14>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #14\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v0.h}[6], [%x[input]], #2\n"
+      "ld1 {v4.2s}, [x1], #8\n"
+      "ld1 {v4.s}[2], [x1], #4\n"
+      "ld1 {v4.h}[6], [x1], #2\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s}, [%x[output]], #48\n"
+      "st1 {v3.2s}, [%x[output]], #8\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+template <>
+inline void Transform1DKernel<uint8_t, int32_t, BiasAdd<uint8_t>, 16,
+                              15>::Transform(const uint8_t* input,
+                                             const BiasAdd<uint8_t>& params,
+                                             int32_t* output) {
+#ifdef DEBUG
+#ifdef DEBUG_METAGEMM_VERBOSE
+  std::cout << __FILE__ << "(" << __LINE__
+            << ") BiasAdd<uint8_t><uint8_t, int32_t, BiasAdd<uint8_t>, 16, "
+               "15>::Transform()"
+            << std::endl
+            << std::flush;
+#endif
+#endif
+  int params_rows_copy = params.rows;
+  asm volatile(
+      "ldr w0, %[input_range_min]\n"
+      "dup v8.4s, w0\n"
+      "ldr w0, %[input_range_scale]\n"
+      "dup v9.4s, w0\n"
+      "ldr w0, %[bias_range_min]\n"
+      "dup v10.4s, w0\n"
+      "ldr w0, %[bias_range_scale]\n"
+      "dup v11.4s, w0\n"
+      "ldr w0, %[output_range_min]\n"
+      "dup v12.4s, w0\n"
+      "ldr w0, %[one_over_output_range_scale]\n"
+      "dup v13.4s, w0\n"
+      "ldr w0, %[output_range_offset]\n"
+      "dup v14.4s, w0\n"
+      "1:"
+      "mov x0, %x[count]\n"
+      "mov x1, %x[bias]\n"
+      "subs x0, x0, #15\n"
+      "beq 3f\n"
+      "2:"
+      "subs x0, x0, #16\n"
+
+      // BiasAdd::Transform
+      "ld1 {v0.4s}, [%x[input]], #16\n"
+      "ld1 {v4.4s}, [x1], #16\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s, v3.4s}, [%x[output]], #64\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "bne 2b\n"
+      "3:"
+
+      // BiasAdd::Transform
+      "ld1 {v0.2s}, [%x[input]], #8\n"
+      "ld1 {v0.s}[2], [%x[input]], #4\n"
+      "ld1 {v0.h}[6], [%x[input]], #2\n"
+      "ld1 {v0.b}[14], [%x[input]], #1\n"
+      "ld1 {v4.2s}, [x1], #8\n"
+      "ld1 {v4.s}[2], [x1], #4\n"
+      "ld1 {v4.h}[6], [x1], #2\n"
+      "ld1 {v4.b}[14], [x1], #1\n"
+      "prfm pldl1keep, [%x[input], #32]\n"
+      "uxtl2 v1.8h, v0.16b\n"
+      "uxtl v0.8h, v0.8b\n"
+      "uxtl2 v5.8h, v4.16b\n"
+      "uxtl v4.8h, v4.8b\n"
+      "sxtl2 v3.4s, v1.8h\n"
+      "sxtl v2.4s, v1.4h\n"
+      "sxtl2 v7.4s, v5.8h\n"
+      "sxtl v6.4s, v5.4h\n"
+      "sxtl2 v1.4s, v0.8h\n"
+      "sxtl v0.4s, v0.4h\n"
+      "sxtl2 v5.4s, v4.8h\n"
+      "sxtl v4.4s, v4.4h\n"
+      "scvtf v0.4s, v0.4s\n"
+      "scvtf v1.4s, v1.4s\n"
+      "scvtf v2.4s, v2.4s\n"
+      "scvtf v3.4s, v3.4s\n"
+      "scvtf v4.4s, v4.4s\n"
+      "scvtf v5.4s, v5.4s\n"
+      "scvtf v6.4s, v6.4s\n"
+      "scvtf v7.4s, v7.4s\n"
+      "fmul v0.4s, v0.4s, v9.4s\n"
+      "fmul v1.4s, v1.4s, v9.4s\n"
+      "fmul v2.4s, v2.4s, v9.4s\n"
+      "fmul v3.4s, v3.4s, v9.4s\n"
+      "fmul v4.4s, v4.4s, v11.4s\n"
+      "fmul v5.4s, v5.4s, v11.4s\n"
+      "fmul v6.4s, v6.4s, v11.4s\n"
+      "fmul v7.4s, v7.4s, v11.4s\n"
+      "fadd v0.4s, v0.4s, v8.4s\n"
+      "fadd v1.4s, v1.4s, v8.4s\n"
+      "fadd v2.4s, v2.4s, v8.4s\n"
+      "fadd v3.4s, v3.4s, v8.4s\n"
+      "fadd v4.4s, v4.4s, v10.4s\n"
+      "fadd v5.4s, v5.4s, v10.4s\n"
+      "fadd v6.4s, v6.4s, v10.4s\n"
+      "fadd v7.4s, v7.4s, v10.4s\n"
+      "fadd v0.4s, v0.4s, v4.4s\n"
+      "fadd v1.4s, v1.4s, v5.4s\n"
+      "fadd v2.4s, v2.4s, v6.4s\n"
+      "fadd v3.4s, v3.4s, v7.4s\n"
+      "fsub v0.4s, v0.4s, v12.4s\n"
+      "fsub v1.4s, v1.4s, v12.4s\n"
+      "fsub v2.4s, v2.4s, v12.4s\n"
+      "fsub v3.4s, v3.4s, v12.4s\n"
+      "fmul v0.4s, v0.4s, v13.4s\n"
+      "fmul v1.4s, v1.4s, v13.4s\n"
+      "fmul v2.4s, v2.4s, v13.4s\n"
+      "fmul v3.4s, v3.4s, v13.4s\n"
+      "fadd v0.4s, v0.4s, v14.4s\n"
+      "fadd v1.4s, v1.4s, v14.4s\n"
+      "fadd v2.4s, v2.4s, v14.4s\n"
+      "fadd v3.4s, v3.4s, v14.4s\n"
+      "fcvtzs v0.4s, v0.4s\n"
+      "fcvtzs v1.4s, v1.4s\n"
+      "fcvtzs v2.4s, v2.4s\n"
+      "fcvtzs v3.4s, v3.4s\n"
+
+      "st1 {v0.4s, v1.4s, v2.4s}, [%x[output]], #48\n"
+      "st1 {v3.2s}, [%x[output]], #8\n"
+      "st1 {v3.s}[2], [%x[output]], #4\n"
+      "prfm pldl1keep, [%x[output]]\n"
+      "subs %x[rows], %x[rows], #1\n"
+      "bne 1b\n"
+      : [input] "+r"(input), [output] "+r"(output)
+      : [count] "r"(params.count), [rows] "r"(params_rows_copy),
+        [output_range_offset] "m"(params.output_range_offset),
+        [input_range_scale] "m"(params.input_range_scale),
+        [one_over_output_range_scale] "m"(params.one_over_output_range_scale),
+        [bias_range_min] "m"(params.bias_range_min),
+        [output_range_min] "m"(params.output_range_min),
+        [bias_range_scale] "m"(params.bias_range_scale),
+        [bias] "r"(params.bias), [input_range_min] "m"(params.input_range_min)
+      : "x0", "x1", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "v8", "v9",
+        "v10", "v11", "v12", "v13", "v14", "cc", "memory");
+}
+
+}  // namespace meta
+}  // namespace gemmlowp
+
+#else
+#warning "Meta gemm for arm64 requires: GEMMLOWP_NEON_64!"
+#endif
+
+#endif  // GEMMLOWP_META_TRANSFORM_KERNELS_ARM_64_H_
diff --git a/public/bit_depth.h b/public/bit_depth.h
index fcda3e1..6cb4ecf 100644
--- a/public/bit_depth.h
+++ b/public/bit_depth.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -12,113 +12,50 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
-// bit_depth.h: defines the BitDepthSetting enum
+// bit_depth.h: defines the settins controlling LHS/RHS bit depth
 
 #ifndef GEMMLOWP_PUBLIC_BIT_DEPTH_H_
 #define GEMMLOWP_PUBLIC_BIT_DEPTH_H_
 
 namespace gemmlowp {
 
-// A specific bit depth to requantize an operand (Lhs or Rhs) to.
-// The case tBits==8 means no requantization, since at the moment
-// we only accept 8-bit input data.
-template <int tBits>
-struct BitDepth {
-  static const int kBits = tBits;
-  static_assert(kBits >= 1 && kBits <= 8, "bad bit depth");
+// The range of allowed values for an operand.
+template <int tMinValue, int tMaxValue>
+struct OperandRange {
+  static const int kMinValue = tMinValue;
+  static const int kMaxValue = tMaxValue;
+  static_assert(0 <= kMinValue, "");
+  static_assert(kMinValue < kMaxValue, "");
+  static_assert(kMaxValue <= 255, "");
 };
 
-// A rounding mode to use when requantizing an operand.
-// The requantizing operation is:
-//   dst = (src * maxval + rounding_offset) / 255;
-// Where dst and src are uint8, maxval is 2^(dstbits)-1,
-// and the intermediate values are computed as uint16s
-// so no overflow occurs.
-// The rounding_offset in the above formula is a value
-// in [0..254] determined by the RoundingMode as follows:
-enum class RoundingMode {
-  Exact,                  // No rounding, do nothing. Use with bit_depth == 8.
-  Nearest,                // rounding_offset = 127
-  ProbabilisticXorshift,  // rounding_offset given by 8-bit Xorshift PRNG
-  ProbabilisticAddmod     // rounding_offset given by 8-bit add/mod LDSG
+using Uint8Range = OperandRange<0, 255>;
+using Uint8RangeExcludingZero = OperandRange<1, 255>;
+
+template <typename tLhsRange, typename tRhsRange>
+struct BitDepthParams {
+  using LhsRange = tLhsRange;
+  using RhsRange = tRhsRange;
 };
 
-// A rounding strategy is a heuristic for choosing a rounding mode.
-// When the bit depth is 8 bit like the source, there is no
-// quantization to be done, so this is moot. In this case, we use
-// the following "no-op" "strategy",
-struct ExactRoundingStrategyFor8Bit {
-  static const RoundingMode kRoundingModeForSmallSizes = RoundingMode::Exact;
-  static const RoundingMode kRoundingModeForLargeSizes = RoundingMode::Exact;
-  static const int kRoundingModeSizeThreshold = 0;
-};
+// Default: LHS and RHS are 8bit.
+using DefaultL8R8BitDepthParams = BitDepthParams<Uint8Range, Uint8Range>;
 
-// Default rounding strategy when actually requantizing to less than 8 bit.
-// Round-to-nearest tends to give the best results for small enough
-// accumulation sizes (i.e. accumulation depth, but we refrain from using
-// the word "depth" here as it gets confusing with "bit depth").
-// Some flavor of probabilistic tends to perform better for larger sizes.
-// See doc/less-than-8-bit.txt for details.
-struct DefaultRoundingStrategyForLessThan8Bit {
-  static const RoundingMode kRoundingModeForSmallSizes = RoundingMode::Nearest;
-  static const RoundingMode kRoundingModeForLargeSizes =
-      RoundingMode::ProbabilisticAddmod;
+// Variant: LHS may not take the value 0. This allows using
+// faster kernels using signed arithmetic, see
+// NEON_64bit_GEMM_Int8Operands_Int32Accumulators_AccumTwoWithin16Bits
+using L8R8WithLhsNonzeroBitDepthParams =
+    BitDepthParams<Uint8RangeExcludingZero, Uint8Range>;
 
-  // The threshold on the depth dimension at which we switch to
-  // probabilistic rounding instead of rounding-to-nearest when
-  // requantizing input data. Indeed, both statistical theory and
-  // empirical measurements show that for given input data and bit depth,
-  // probabilistic rounding gives more accurate results for large enough
-  // depth, while rounding-to-nearest does for smaller depth. This threshold
-  // is naively determined from some experiments with Inception at 7bit/5bit
-  // on a set of 10,000 images with 8-bit Xorshift probabilistic rounding:
-  //
-  //   7 bit weights, 5 bit activations, switch at 64:   59.82% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 128:  59.58% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 192:  63.37% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 256:  63.47% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 320:  63.71% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 384:  63.71% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 448:  63.58% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 512:  64.10% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 640:  62.49% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 768:  62.49% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 1024: 58.96% top-1 accuracy
-  //
-  // So here, 384 looks comfortably in the middle of a plateau of good values,
-  // and it's a roundish number (3/2 * 256) so let's stick with that for now.
-  // It would be nice to work out the theory of this, and understand how this
-  // should depend on the distribution of inputs and the bit depth.
-  //
-  // Repeating the same evaluation with AddMod:
-  //   7 bit weights, 5 bit activations, switch at 64:   62.65% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 128:  62.65% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 192:  63.81% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 256:  64.23% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 320:  64.16% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 384:  64.16% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 448:  64.16% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 512:  64.52% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 640:  62.74% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 768:  62.74% top-1 accuracy
-  //   7 bit weights, 5 bit activations, switch at 1024: 59.74% top-1 accuracy
-  //
-  // The behavior is similar, so 384 remains a good choice.
-
-  static const int kRoundingModeSizeThreshold = 384;
-};
-
-struct DefaultL8R8BitDepthParams {
-  typedef BitDepth<8> LhsBitDepth;
-  typedef BitDepth<8> RhsBitDepth;
-  typedef ExactRoundingStrategyFor8Bit RoundingStrategy;
-};
-
-struct DefaultL7R5BitDepthParams {
-  typedef BitDepth<7> LhsBitDepth;
-  typedef BitDepth<5> RhsBitDepth;
-  typedef DefaultRoundingStrategyForLessThan8Bit RoundingStrategy;
-};
+// Deprecated: when gemmlowp used to allow requantizing 8bit
+// inputs to less-than-8-bit depths, the public setting allowing
+// that was DefaultL7R5BitDepthParams. That requantization
+// feature has been removed, but as the whole point of that
+// requantization was to make less-than-8-bit an internal
+// optimization without any impact on the API (other than lowering
+// accuracy), we can temporarily support users who were using it
+// by mapping it to the default 8bit behavior.
+using DefaultL7R5BitDepthParams = DefaultL8R8BitDepthParams;
 
 }  // namespace gemmlowp
 
diff --git a/public/gemmlowp.h b/public/gemmlowp.h
index c0724ea..05b0f47 100644
--- a/public/gemmlowp.h
+++ b/public/gemmlowp.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -16,77 +16,30 @@
 
 #ifndef GEMMLOWP_PUBLIC_GEMMLOWP_H_
 #define GEMMLOWP_PUBLIC_GEMMLOWP_H_
-#include "../internal/kernel_default.h"
-#include "../internal/multi_thread_gemm.h"
-#include "../internal/unpack.h"
+#include "../internal/dispatch_gemm_shape.h"
 #include "bit_depth.h"
 #include "map.h"
 #include "output_stages.h"
 
 namespace gemmlowp {
 
-inline bool IsRequantizationWorthIt(int rows, int cols) {
-  // We pack depth*(rows+cols) and compute depth*rows*cols.
-  // Thus the ratio of compute/packing cost is rows*cols/(rows+cols)
-  // In the square case rows==cols==N, it becomes N/2.
-  return 2 * rows * cols >= (rows + cols) * kMinimumWidthForRequantization;
-}
-
 class GemmContext : public MultiThreadGemmContext {};
 
 // Computes a general matrix product ("GEMM").
 // This is a version that supports per channel quantization.
 template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
           MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
-          typename LhsOffset, typename RhsOffset, typename OutputPipelineType>
-void GemmWithOutputPipelinePC(GemmContext* context,
+          typename LhsOffset, typename RhsOffset, typename OutputPipelineType,
+          typename GemmContextType>
+void GemmWithOutputPipelinePC(GemmContextType* context,
                               const MatrixMap<const InputScalar, LhsOrder>& lhs,
                               const MatrixMap<const InputScalar, RhsOrder>& rhs,
                               MatrixMap<OutputScalar, ResultOrder>* result,
                               const LhsOffset& lhs_offset,
                               const RhsOffset& rhs_offset,
                               const OutputPipelineType& output_pipeline) {
-  assert(lhs.cols() == rhs.rows());
-
-  int rows = result->rows();
-  int cols = result->cols();
-  int depth = lhs.cols();
-
-  if (rows == 0 || cols == 0 || depth == 0) {
-    // Vacuous GEMM, return early to avoid having to deal with
-    // zero sizes below.
-    return;
-  }
-
-  if (cols == 1) {
-    if (IsRequantizationWorthIt(rows, cols)) {
-      typedef DefaultKernel<KernelFamily::Gemv, BitDepthParams> Kernel;
-      MultiThreadGemm<typename Kernel::Format, InputScalar, OutputScalar,
-                      BitDepthParams>(context, Kernel(), lhs, rhs, result,
-                                      lhs_offset, rhs_offset, output_pipeline);
-    } else {
-      typedef DefaultKernel<KernelFamily::Gemv, DefaultL8R8BitDepthParams>
-          Kernel;
-      MultiThreadGemm<typename Kernel::Format, InputScalar, OutputScalar,
-                      DefaultL8R8BitDepthParams>(context, Kernel(), lhs, rhs,
-                                                 result, lhs_offset, rhs_offset,
-                                                 output_pipeline);
-    }
-  } else {
-    if (IsRequantizationWorthIt(rows, cols)) {
-      typedef DefaultKernel<KernelFamily::Gemm, BitDepthParams> Kernel;
-      MultiThreadGemm<typename Kernel::Format, InputScalar, OutputScalar,
-                      BitDepthParams>(context, Kernel(), lhs, rhs, result,
-                                      lhs_offset, rhs_offset, output_pipeline);
-    } else {
-      typedef DefaultKernel<KernelFamily::Gemm, DefaultL8R8BitDepthParams>
-          Kernel;
-      MultiThreadGemm<typename Kernel::Format, InputScalar, OutputScalar,
-                      DefaultL8R8BitDepthParams>(context, Kernel(), lhs, rhs,
-                                                 result, lhs_offset, rhs_offset,
-                                                 output_pipeline);
-    }
-  }
+  DispatchGemmShape<InputScalar, OutputScalar, BitDepthParams>(
+      context, lhs, rhs, result, lhs_offset, rhs_offset, output_pipeline);
 }
 
 // Computes a general matrix product ("GEMM").
@@ -96,16 +49,18 @@
 // (which is also implemented in the eight_bit_int_gemm directory).
 template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
           MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
-          typename OutputPipelineType>
-void GemmWithOutputPipeline(GemmContext* context,
+          typename OutputPipelineType, typename GemmContextType>
+void GemmWithOutputPipeline(GemmContextType* context,
                             const MatrixMap<const InputScalar, LhsOrder>& lhs,
                             const MatrixMap<const InputScalar, RhsOrder>& rhs,
                             MatrixMap<OutputScalar, ResultOrder>* result,
                             int lhs_offset, int rhs_offset,
                             const OutputPipelineType& output_pipeline) {
+  typedef VectorDup<const std::int32_t, VectorShape::Col> OffsetColDup;
+  typedef VectorDup<const std::int32_t, VectorShape::Row> OffsetRowDup;
   const OffsetColDup lhs_offset_vector(lhs_offset, lhs.rows());
   const OffsetRowDup rhs_offset_vector(rhs_offset, rhs.cols());
-  GemmWithOutputPipelinePC<InputScalar, OutputScalar, BitDepthParams>(
+  DispatchGemmShape<InputScalar, OutputScalar, BitDepthParams>(
       context, lhs, rhs, result, lhs_offset_vector, rhs_offset_vector,
       output_pipeline);
 }
@@ -115,8 +70,9 @@
 // parameters is the same as in the standard EightBitIntGemm interface
 // (which is also implemented in the eight_bit_int_gemm directory).
 template <typename Scalar, typename BitDepthParams, MapOrder LhsOrder,
-          MapOrder RhsOrder, MapOrder ResultOrder>
-void Gemm(GemmContext* context, const MatrixMap<const Scalar, LhsOrder>& lhs,
+          MapOrder RhsOrder, MapOrder ResultOrder, typename GemmContextType>
+void Gemm(GemmContextType* context,
+          const MatrixMap<const Scalar, LhsOrder>& lhs,
           const MatrixMap<const Scalar, RhsOrder>& rhs,
           MatrixMap<Scalar, ResultOrder>* result, int lhs_offset,
           int rhs_offset, int result_offset, int result_mult_int,
diff --git a/public/map.h b/public/map.h
index ce6428e..3073e05 100644
--- a/public/map.h
+++ b/public/map.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -19,7 +19,6 @@
 #define GEMMLOWP_PUBLIC_MAP_H_
 
 #include "../internal/common.h"
-#include "../internal/iterator.h"
 
 namespace gemmlowp {
 
@@ -41,6 +40,11 @@
 
  public:
   MatrixMap() : data_(nullptr), rows_(0), cols_(0), stride_(0) {}
+  MatrixMap(Scalar* data, int rows, int cols)
+      : data_(data),
+        rows_(rows),
+        cols_(cols),
+        stride_(kOrder == MapOrder::ColMajor ? rows : cols) {}
   MatrixMap(Scalar* data, int rows, int cols, int stride)
       : data_(data), rows_(rows), cols_(cols), stride_(stride) {}
   MatrixMap(const MatrixMap& other)
@@ -95,6 +99,13 @@
   Scalar* data() const { return data_; }
   Scalar* data(int index) const { return data_ + index; }
   Scalar& operator()(int index) const { return *data(index); }
+
+  VectorMap block(int start, int len) const {
+    assert(start >= 0);
+    assert(start + len <= size_);
+
+    return VectorMap(data(start), len);
+  }
 };
 
 // A VectorDup is a (duplicated value) vector where all components are the same.
@@ -114,7 +125,14 @@
   VectorDup(const VectorDup& other) : data_(other.data_), size_(other.size_) {}
 
   int size() const { return size_; }
-  Scalar& operator()(int index) const { return data_; }
+  Scalar& operator()(int) const { return data_; }
+
+  VectorDup block(int start, int len) const {
+    assert(start >= 0);
+    assert(start + len <= size_);
+
+    return VectorDup(data_, len);
+  }
 };
 
 }  // namespace gemmlowp
diff --git a/public/output_stages.h b/public/output_stages.h
index c646e7f..23bcdc0 100644
--- a/public/output_stages.h
+++ b/public/output_stages.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -65,6 +65,58 @@
   std::int32_t result_shift;
 };
 
+// This output stage takes int32 values and returns still int32 values,
+// but "quantized down" to the uint8 scale; in other words, its output
+// is typically what one would then clamp to [0..255] and cast to uint8
+// (see OutputStageSaturatingCastToUint8).
+//
+// This "quantization down" process depends on 3 parameters,
+//   result_offset, result_fixedpoint_multiplier, result_shift,
+// and the result is:
+//   ((FixedPointMul(input, result_fixedpoint_multiplier) +
+//   rounding) >> result_shift) + result_offset_after_shift
+// where
+//   rounding = (result_shift < 1) ? 0 : (1 << (result_shift - 1));
+// and where FixedPointMul(x, y) is the nearest integer to the following
+// mathematical expression, evaluated without overflow or intermediate
+// rounding:
+//   (x * y) / 2^31
+// In practice, it is expected that FixedPointMul will be implemented
+// using hardware "rounding doubling int32 multiply high" instructions,
+// such as VQRDMULH on ARM. See in fixedpoint.h the generic function,
+// SaturatingRoundingDoublingHighMul.
+//
+// Notice that the other difference from
+// OutputStageQuantizeDownInt32ToUint8Scale is that the result offset
+// is applied after the multiplier and shift, not before. This ensures
+// that no matter what the multiplier and shift are, the result offset
+// is effectively integral: offsetting the final result by an integer.
+// The motivation for this is to faithfully support quantization schemes
+// where the formula linking quantized values to the real mathematical
+// values that they represent, is of the form
+//
+//   real_value = scale * (quantized_value - zero_point)
+//
+// where scale is a real number (represented in quantized form by
+// result_fixedpoint_multiplier and result_shift) and zero_point
+// is an integer telling which quantized value correspond to the
+// real value 0, and is represented here by (the opposite of)
+// result_offset_after_shift.
+// The motivation for such a quantization scheme, designed to
+// ensure that 0 is always a representable value, is that in
+// many applications, we need to 0-pad arrays and that can only be
+// done for quantized arrays if 0 is a representable value in
+// quantized form. In particular, convolution-like operations
+// are often implemented using 0-padding, or "im2col"-like
+// expansions that implicitly rely on 0-padding. If 0 were not
+// a representable value, such operations would have to pad
+// using a nonzero value, introducing bias in the computation.
+struct OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint {
+  std::int32_t result_fixedpoint_multiplier;
+  std::int32_t result_shift;
+  std::int32_t result_offset_after_shift;
+};
+
 // This output stage takes int32 values that are expected to be already
 // on the final uint8 scale, but not necessarily in the [0..255] range.
 // It clamps them to the [0..255] range and returns them casted to uint8.
@@ -116,11 +168,10 @@
 template <VectorShape tShape>
 inline std::tuple<OutputStageQuantizeDownInt32ToUint8ScalePC<tShape>,
                   OutputStageSaturatingCastToUint8>
-MakeStandardOutputPipeline(const VectorMap<const std::int32_t, tShape>&
-                               result_offset,
-                           const VectorMap<const std::int32_t, tShape>&
-                               result_mult_int,
-                           std::int32_t result_shift) {
+MakeStandardOutputPipeline(
+    const VectorMap<const std::int32_t, tShape>& result_offset,
+    const VectorMap<const std::int32_t, tShape>& result_mult_int,
+    std::int32_t result_shift) {
   OutputStageQuantizeDownInt32ToUint8ScalePC<tShape> quantize_down_stage;
   quantize_down_stage.result_offset = result_offset;
   quantize_down_stage.result_mult_int = result_mult_int;
diff --git a/scripts/ci-before.sh b/scripts/ci-before.sh
new file mode 100755
index 0000000..ae2a16a
--- /dev/null
+++ b/scripts/ci-before.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+if [ $TEST == "arm" ]; then
+  curl -L https://dl.google.com/android/repository/android-ndk-${NDK_VERSION}-linux-x86_64.zip -O
+  unzip android-ndk-${NDK_VERSION}-linux-x86_64.zip 2> /dev/null > /dev/null
+  echo no | android create avd --force -n test -t android-22 --abi armeabi-v7a
+  emulator -avd test -no-audio -no-window &
+fi
diff --git a/scripts/ci-test.sh b/scripts/ci-test.sh
new file mode 100755
index 0000000..de6e344
--- /dev/null
+++ b/scripts/ci-test.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+if [ $TEST == "arm" ]; then
+  ./android-ndk-${NDK_VERSION}/ndk-build
+  android-wait-for-emulator
+  # adb shell input keyevent 82 &
+  adb push ./libs/* /data/local/tmp
+  adb shell /data/local/tmp/benchmark
+  adb shell /data/local/tmp/correctness_meta_gemm
+  # too slow
+  # adb shell /data/local/tmp/benchmark_meta_gemm
+fi
+if [ $TEST == "x86" ]; then
+  make -f Makefile.travis unittest
+fi  
diff --git a/scripts/test-android.sh b/scripts/test-android.sh
index 444c4b6..66873a2 100755
--- a/scripts/test-android.sh
+++ b/scripts/test-android.sh
@@ -1,4 +1,5 @@
-# Copyright 2015 Google Inc. All Rights Reserved.
+#!/bin/bash
+# Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,8 +13,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-#!/bin/bash
-
 if [ -z "$CXX" ]
 then
   echo "please set the CXX environment variable to point to your native Android toolchain C++ compiler"
diff --git a/standalone/neon-gemm-kernel-benchmark.cc b/standalone/neon-gemm-kernel-benchmark.cc
new file mode 100644
index 0000000..2a936c1
--- /dev/null
+++ b/standalone/neon-gemm-kernel-benchmark.cc
@@ -0,0 +1,3746 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// This is a standalone testbed and benchmark for gemmlowp-style GEMM kernels,
+// either doing integer or float arithmetic.
+// It verifies that a kernel produces correct results, then benchmarks it.
+//
+// Some benchmark results are recorded in this spreadsheet:
+//
+// https://docs.google.com/spreadsheets/d/1UPbzbp9rdsD6RXxOr5q6AZ0n1omgEknLYO2ogiw6Kqk/edit?usp=sharing
+//
+// This program is entirely self-contained, and can be compiled manually
+// such as suggested in the command lines below.
+// It currently supports only Android/ARM but would trivially generalize to
+// other OSes (it's mostly standard POSIX) or architectures (each kernel
+// targets a specific architecture, one may simply add more).
+
+/*
+ Build and run this benchmark on Android/ARM/32bit:
+ ~/android/toolchains/arm-linux-androideabi/bin/arm-linux-androideabi-clang++ \
+ -fPIE -pie -O3 --std=c++11 standalone/neon-gemm-kernel-benchmark.cc -o \
+ /tmp/benchmark -mfloat-abi=softfp -mfpu=neon-vfpv4 && adb push /tmp/benchmark \
+ /data/local/tmp && adb shell /data/local/tmp/benchmark
+ Build and run this benchmark on Android/ARM/64bit:
+ ~/android/toolchains/aarch64-linux-android/bin/aarch64-linux-android-clang++ \
+ -fPIE -static -O3 --std=c++11 standalone/neon-gemm-kernel-benchmark.cc -o \
+ /tmp/benchmark && adb push /tmp/benchmark /data/local/tmp && adb shell \
+ /data/local/tmp/benchmark
+ */
+
+// For big.LITTLE devices, use 'taskset' to select which cores to benchmark.
+//
+// The syntax is: taskset <mask> <commandline>
+// where mask is a binary mask where each bit corresponds to a core,
+// and low bits are little cores.
+//
+// Examples:
+// Nexus 5X big cores: taskset 30
+// Nexus 5X little cores: taskset 0f
+// Pixel XL big cores: taskset 0c
+// Pixel XL little cores: taskset 03
+//
+// Full example:
+// adb shell taskset 0c /data/local/tmp/benchmark
+
+#include <sched.h>
+#include <unistd.h>
+
+#include <algorithm>
+#include <cassert>
+#include <cstdint>
+#include <cstdlib>
+#include <iostream>
+#include <random>
+#include <type_traits>
+
+#if !defined __arm__ && !defined __aarch64__
+#error This benchmark assumes ARM (for inline assembly sections).
+#endif
+
+#include <arm_neon.h>
+
+// Typically one wants to fit in L1 cache, and GEMM implementations
+// are carefully optimized to tune their access patterns to that effect.
+// Most devices have at least 16k of L1 cache. The Kraits have exactly 16k.
+const int kDefaultCacheSizeK = 16;
+
+const int kCacheLineSize = 64;
+
+// These definitions are used for labels within assembly code. Required for
+// iOS toolchain compatibility.
+#define GEMMLOWP_LABEL_AFTER_LOOP "1"
+#define GEMMLOWP_LABEL_LOOP "2"
+#define GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES "3"
+#define GEMMLOWP_LABEL_STORE "4"
+
+// BEGIN code copied from gemmlowp/internal/kernel.h
+
+// Explanation of general gemmlowp terminology
+// ===========================================
+//
+// We use the following abbreviations:
+// LHS = "left-hand side"
+// RHS = "right-hand side"
+// Sometimes when referring to either LHS or RHS, we just say a "Side".
+//
+// In a matrix product of a MxK matrix times a KxN matrix,
+// we call K the 'depth'. Note that M is the number of rows
+// of the result (and of the LHS), and N is the number of columns
+// of the result (and of the RHS).
+//
+// In each of the LHS and RHS matrices, we call 'width' the
+// other dimension, besides the depth. So in the LHS, 'width'
+// is the number of rows, while in the RHS, 'width' is the number
+// of columns.
+//
+//  So in the LHS MxK matrix, the depth is K and the width in M.
+// And in the RHS KxN matrix, the depth is K and the width in N.
+//
+// This is illustrated in this picture:
+//
+//                             RHS width
+//                        <----------------->
+//                        +-----------------+ ^
+//                        |       RHS       | | Depth
+//                        +-----------------+ v
+//                 ^ +--+ +-----------------+
+//                 | |L | |                 |
+//       LHS width | |H | |      Result     |
+//                 | |S | |                 |
+//                 v +--+ +-----------------+
+//                   <-->
+//                   Depth
+
+// Explanation of gemmlowp kernel formats and "cells"
+// ==================================================
+//
+// Kernels operate on small LHS and RHS blocks that fit in registers.
+// These blocks are stored contiguously in memory, but not always
+// in a traditional column-major or row-major order; instead,
+// they consist of a number of sub-blocks, which we call "cells",
+// that are stored in column-major or row-major order. However,
+// what really matters to us is not so much rows vs columns, but
+// rather width vs depth. So we refer to "width-major" and "depth-major"
+// storage orders. In the LHS, width-major means row-major,
+// while in the RHS, width-major means column-major.
+// There is also a third possibility, "diagonal order",
+// which is unused at the moment.
+//
+// We aim to treat both sides, LHS and RHS, on an equal footing,
+// so we call them both 'sides'. A KernelFormat thus is just a pair
+// of KernelSideFormat's, one for LHS and one for RHS; each KernelSideFormat
+// contains a CellFormat and a number of cells; cells are only ever
+// stacked in the width dimension, which means stacked vertically in the
+// LHS and stacked horizondally in the RHS.
+//
+// Example
+// =======
+//
+// Let's work out the data layout expected by a kernel having the
+// following format (the struct names here are defined below in this file):
+//
+// KernelFormat<
+//   KernelSideFormat<CellFormat<3, 4>, 3>,
+//   KernelSideFormat<CellFormat<5, 4>, 2>
+// >
+//
+// The LHS format, KernelSideFormat<CellFormat<3, 4>, 3>, means:
+// 3 cells, each cell having dimensions (width=3, depth=4), laid out in
+// DepthMajor order (the default value, see CellFormat). In the LHS,
+// DepthMajor means column-major, so the LHS cells are of size 3x4 in
+// column-major order, so the LHS layout is:
+//
+// 0  3  6  9
+// 1  4  7  10
+// 2  5  8  11
+// 12 15 18 21
+// 13 16 19 22
+// 14 17 20 23
+// 24 27 30 33
+// 25 28 31 34
+// 26 29 32 35
+//
+// The RHS format, KernelSideFormat<CellFormat<5, 4>, 2>, means:
+// 2 cells each having dimensions (width=5, depth=4), laid out in
+// DepthMajor order (the default value, see CellFormat). In the RHS,
+// DepthMajor means row-major, so the RHS cells are of size 4x5 in
+// row-major order, so the RHS layout is:
+//
+// 0  1  2  3  4  20 21 22 23 24
+// 5  6  7  8  9  25 26 27 28 29
+// 10 11 12 13 14 30 31 32 33 34
+// 15 16 17 18 19 35 36 37 38 39
+
+// CellOrder enumerates the possible storage orders (=layouts) for
+// a cell (see explanation above).
+enum class CellOrder { DepthMajor, WidthMajor, Diagonal };
+
+// CellFormat describes how data is laid
+// out in a cell. That is, a CellOrder together with actual dimensions.
+template <int tWidth, int tDepth, CellOrder tOrder>
+struct CellFormat {
+  static const int kWidth = tWidth;
+  static const int kDepth = tDepth;
+  static const CellOrder kOrder = tOrder;
+
+  static const int kSize = kWidth * kDepth;
+};
+
+// KernelSideFormat describes how data is laid out in a kernel side
+// (i.e. LHS or RHS). That is, a CellFormat together with a number of
+// cells. These cells are always stacked in the Width dimension.
+// For example, in the LHS case, the Width dimension is the rows dimension,
+// se we're saying that in the LHS, cells are stacked vertically.
+// We never stack cells in the Depth dimension.
+template <typename tCellFormat, int tCells>
+struct KernelSideFormat {
+  typedef tCellFormat Cell;
+  static const int kCells = tCells;
+  static const int kWidth = kCells * Cell::kWidth;
+  static const int kDepth = Cell::kDepth;
+};
+
+// KernelFormat describes fully the input data layout that a kernel expects.
+// It consists of two KernelSideFormat's, one for LHS and one for RHS.
+template <typename tLhs, typename tRhs>
+struct KernelFormat {
+  typedef tLhs Lhs;
+  typedef tRhs Rhs;
+
+  static_assert(Lhs::Cell::kDepth == Rhs::Cell::kDepth, "");
+  static const int kDepth = Lhs::Cell::kDepth;
+  static const int kRows = Lhs::Cell::kWidth * Lhs::kCells;
+  static const int kCols = Rhs::Cell::kWidth * Rhs::kCells;
+};
+
+inline const char* CellOrderName(CellOrder o) {
+  switch (o) {
+    case CellOrder::DepthMajor:
+      return "DepthMajor";
+    case CellOrder::WidthMajor:
+      return "WidthMajor";
+    case CellOrder::Diagonal:
+      return "Diagonal";
+    default:
+      assert(false);
+      return nullptr;
+  }
+}
+
+// Returns the offset into a cell, at which a given coefficient is stored.
+template <typename CellFormat>
+inline int OffsetIntoCell(int w, int d) {
+  switch (CellFormat::kOrder) {
+    case CellOrder::DepthMajor:
+      return w + d * CellFormat::kWidth;
+    case CellOrder::WidthMajor:
+      return d + w * CellFormat::kDepth;
+    case CellOrder::Diagonal:
+      assert(CellFormat::kWidth == CellFormat::kDepth);
+      static const int size = CellFormat::kWidth;
+      return ((size + w - d) * size + d) % (size * size);
+    default:
+      assert(false);
+      return 0;
+  }
+}
+
+// END code copied from gemmlowp/internal/kernel.h
+
+#ifdef __arm__
+
+// This is the current standard kernel in gemmlowp, see:
+// https://github.com/google/gemmlowp/blob/b1e2a29ff866680028f3080efc244e10e8dd7f46/internal/kernel_neon.h#L33
+struct NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators {
+  typedef std::uint8_t OperandType;
+  typedef std::uint32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 2, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 2, CellOrder::DepthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load 1 Rhs cell of size 2x4
+        "vld1.8 {d0}, [%[rhs_ptr]]!\n"
+        // Load 3 Lhs cells of size 4x2 each
+        "vld1.8 {d2}, [%[lhs_ptr]]!\n"
+        "vld1.8 {d4}, [%[lhs_ptr]]!\n"
+        "vld1.8 {d6}, [%[lhs_ptr]]!\n"
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]!\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]!\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]!\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]!\n"
+
+        "subs %[depth], #2\n"
+
+        "beq " GEMMLOWP_LABEL_AFTER_LOOP "f\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+        // Overview of register layout:
+        //
+        // A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0).
+        // A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in d2--d7
+        // (q1--q3).
+        // A 12x4 block of accumulators is stored in 32bit in q4--q15.
+        //
+        //                   +-----+-----+-----+-----+
+        //                   |d0[0]|d0[1]|d0[2]|d0[3]|
+        //              Rhs  +-----+-----+-----+-----+
+        //                   |d1[0]|d1[1]|d1[2]|d1[3]|
+        //                   +-----+-----+-----+-----+
+        //
+        //                   |     |     |     |     |
+        //
+        //    Lhs            |     |     |     |     |
+        //
+        //  +--+--+ - - - -  +-----+-----+-----+-----+
+        //  |d2|d3|          | q4  | q5  | q6  | q7  |
+        //  |d2|d3|          | q4  | q5  | q6  | q7  |
+        //  |d2|d3|          | q4  | q5  | q6  | q7  |
+        //  |d2|d3|          | q4  | q5  | q6  | q7  |
+        //  +--+--+ - - - -  +-----+-----+-----+-----+
+        //  |d4|d5|          | q8  | q9  | q10 | q11 |
+        //  |d4|d5|          | q8  | q9  | q10 | q11 |
+        //  |d4|d5|          | q8  | q9  | q10 | q11 |
+        //  |d4|d5|          | q8  | q9  | q10 | q11 |
+        //  +--+--+ - - - -  +-----+-----+-----+-----+
+        //  |d6|d7|          | q12 | q13 | q14 | q15 |
+        //  |d6|d7|          | q12 | q13 | q14 | q15 |
+        //  |d6|d7|          | q12 | q13 | q14 | q15 |
+        //  |d6|d7|          | q12 | q13 | q14 | q15 |
+        //  +--+--+ - - - -  +-----+-----+-----+-----+
+        //
+        //                            Accumulator
+
+        // Expand Lhs/Rhs cells to 16 bit.
+        // Note: moving theses vmovls further down to allow for
+        // longer data pipelining helps a little on A57 but is
+        // harmful on A53 --- It looks as if A53 doesn't like
+        // interleaving vmovl's into the vmlal's.
+        "vmovl.u8 q0, d0\n"
+        "vmovl.u8 q1, d2\n"
+        "vmovl.u8 q2, d4\n"
+        "vmovl.u8 q3, d6\n"
+
+        // Multiply-accumulate, level of depth 0
+        "vmlal.u16 q4, d2, d0[0]\n"
+        "vmlal.u16 q5, d2, d0[1]\n"
+        "vmlal.u16 q6, d2, d0[2]\n"
+        "vmlal.u16 q7, d2, d0[3]\n"
+        "vldr d2, [%[lhs_ptr]]\n"
+        "vmlal.u16 q8, d4, d0[0]\n"
+        "vmlal.u16 q9, d4, d0[1]\n"
+        "vmlal.u16 q10, d4, d0[2]\n"
+        "vmlal.u16 q11, d4, d0[3]\n"
+        "vldr d4, [%[lhs_ptr], #8]\n"
+        "vmlal.u16 q12, d6, d0[0]\n"
+        "vmlal.u16 q13, d6, d0[1]\n"
+        "vmlal.u16 q14, d6, d0[2]\n"
+        "vmlal.u16 q15, d6, d0[3]\n"
+        "vldr d6, [%[lhs_ptr], #16]\n"
+        "vldr d0, [%[rhs_ptr]]\n"
+
+        // Multiply-accumulate, level of depth 1
+        "vmlal.u16 q4, d3, d1[0]\n"
+        "vmlal.u16 q5, d3, d1[1]\n"
+        "add %[lhs_ptr], #24\n"
+        "vmlal.u16 q6, d3, d1[2]\n"
+        "vmlal.u16 q7, d3, d1[3]\n"
+        "add %[rhs_ptr], #8\n"
+        "vmlal.u16 q8, d5, d1[0]\n"
+        "vmlal.u16 q9, d5, d1[1]\n"
+        "subs %[depth], #2\n"
+        "vmlal.u16 q10, d5, d1[2]\n"
+        "vmlal.u16 q11, d5, d1[3]\n"
+        "vmlal.u16 q12, d7, d1[0]\n"
+        "vmlal.u16 q13, d7, d1[1]\n"
+        "vmlal.u16 q14, d7, d1[2]\n"
+        "vmlal.u16 q15, d7, d1[3]\n"
+
+        "bne " GEMMLOWP_LABEL_LOOP "b\n"
+
+        GEMMLOWP_LABEL_AFTER_LOOP
+        ":\n"
+
+        // Expand Lhs/Rhs cells to 16 bit.
+        "vmovl.u8 q0, d0\n"
+        "vmovl.u8 q1, d2\n"
+        "vmovl.u8 q2, d4\n"
+        "vmovl.u8 q3, d6\n"
+
+        // Multiply-accumulate, level of depth 0
+        "vmlal.u16 q4, d2, d0[0]\n"
+        "vmlal.u16 q5, d2, d0[1]\n"
+        "vmlal.u16 q6, d2, d0[2]\n"
+        "vmlal.u16 q7, d2, d0[3]\n"
+        "vmlal.u16 q8, d4, d0[0]\n"
+        "vmlal.u16 q9, d4, d0[1]\n"
+        "vmlal.u16 q10, d4, d0[2]\n"
+        "vmlal.u16 q11, d4, d0[3]\n"
+        "vmlal.u16 q12, d6, d0[0]\n"
+        "vmlal.u16 q13, d6, d0[1]\n"
+        "vmlal.u16 q14, d6, d0[2]\n"
+        "vmlal.u16 q15, d6, d0[3]\n"
+
+        // Multiply-accumulate, level of depth 1
+        "vmlal.u16 q4, d3, d1[0]\n"
+        "vmlal.u16 q5, d3, d1[1]\n"
+        "vmlal.u16 q6, d3, d1[2]\n"
+        "vmlal.u16 q7, d3, d1[3]\n"
+        "vmlal.u16 q8, d5, d1[0]\n"
+        "vmlal.u16 q9, d5, d1[1]\n"
+        "vmlal.u16 q10, d5, d1[2]\n"
+        "vmlal.u16 q11, d5, d1[3]\n"
+        "vmlal.u16 q12, d7, d1[0]\n"
+        "vmlal.u16 q13, d7, d1[1]\n"
+        "vmlal.u16 q14, d7, d1[2]\n"
+        "vmlal.u16 q15, d7, d1[3]\n"
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vst1.32 {d8, d9},   [r0]!\n"
+        "vst1.32 {d16, d17}, [r0]!\n"
+        "vst1.32 {d24, d25}, [r0]!\n"
+        "vst1.32 {d10, d11}, [r0]!\n"
+        "vst1.32 {d18, d19}, [r0]!\n"
+        "vst1.32 {d26, d27}, [r0]!\n"
+        "vst1.32 {d12, d13}, [r0]!\n"
+        "vst1.32 {d20, d21}, [r0]!\n"
+        "vst1.32 {d28, d29}, [r0]!\n"
+        "vst1.32 {d14, d15}, [r0]!\n"
+        "vst1.32 {d22, d23}, [r0]!\n"
+        "vst1.32 {d30, d31}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+  }
+};
+
+// This is Maciek Chociej's fast kernel not expanding operands,
+// from gemmlowp/meta/. Search for
+//      mul_3x8_3x8_int32_lhsadd_rhsadd
+// in this file:
+// https://raw.githubusercontent.com/google/gemmlowp/e4b9d858b6637d5d0058bfa3d869d2b95864251b/meta/single_thread_gemm.h
+struct NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand {
+  typedef std::uint8_t OperandType;
+  typedef std::uint32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<3, 8, CellOrder::WidthMajor>, 1>,
+      KernelSideFormat<CellFormat<3, 8, CellOrder::WidthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Clear aggregators.
+        "vmov.i32 q0, #0\n"
+        "vmov.i32 q1, #0\n"
+        "vmov.i32 q2, #0\n"
+        "vmov.i32 q3, q0\n"
+        "vmov.i32 q4, q1\n"
+        "vmov.i32 q5, q2\n"
+        "vmov.i32 q6, q3\n"
+        "vmov.i32 q7, q4\n"
+        "vmov.i32 q8, q5\n"
+
+        // Loop head
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Subtract counter.
+        "subs %[depth], %[depth], #8\n"
+
+        "vld1.8 {d18, d19, d20}, [%[rhs_ptr]]!\n"
+        "vld1.8 {d21, d22, d23}, [%[lhs_ptr]]!\n"
+        "vmull.u8 q12, d18, d21\n"
+        "vmull.u8 q13, d18, d22\n"
+        "vmull.u8 q14, d18, d23\n"
+        "vmull.u8 q15, d19, d21\n"
+        "vpadal.u16 q0, q12\n"
+        "vpadal.u16 q1, q13\n"
+        "vpadal.u16 q2, q14\n"
+        "vpadal.u16 q3, q15\n"
+        "vmull.u8 q12, d19, d22\n"
+        "vmull.u8 q13, d19, d23\n"
+        "vmull.u8 q14, d20, d21\n"
+        "vmull.u8 q15, d20, d22\n"
+        "vmull.u8 q9, d20, d23\n"
+        "vpadal.u16 q4, q12\n"
+        "vpadal.u16 q5, q13\n"
+        "vpadal.u16 q6, q14\n"
+        "vpadal.u16 q7, q15\n"
+        "vpadal.u16 q8, q9\n"
+
+        // Loop branch
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Horizontal reduce aggregators, step 1
+        "vpadd.u32 d0, d0, d1\n"
+        "vpadd.u32 d2, d2, d3\n"
+        "vpadd.u32 d4, d4, d5\n"
+        "vpadd.u32 d6, d6, d7\n"
+        "vpadd.u32 d8, d8, d9\n"
+        "vpadd.u32 d10, d10, d11\n"
+        "vpadd.u32 d12, d12, d13\n"
+        "vpadd.u32 d14, d14, d15\n"
+        "vpadd.u32 d16, d16, d17\n"
+
+        // Horizontal reduce aggregators, step 2
+        "vpadd.u32 d0, d0, d2\n"
+        "vpadd.u32 d1, d4, d4\n"
+        "vpadd.u32 d6, d6, d8\n"
+        "vpadd.u32 d7, d10, d10\n"
+        "vpadd.u32 d12, d12, d14\n"
+        "vpadd.u32 d13, d16, d16\n"
+
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d2}, [r0]!\n"
+        "vld1.32 {d3[0]}, [r0]!\n"
+
+        "vld1.32 {d8}, [r0]!\n"
+        "vld1.32 {d9[0]}, [r0]!\n"
+
+        "vld1.32 {d14}, [r0]!\n"
+        "vld1.32 {d15[0]}, [r0]!\n"
+
+        // Accumulate
+        "vadd.s32 q0, q0, q1\n"
+        "vadd.s32 q3, q3, q4\n"
+        "vadd.s32 q6, q6, q7\n"
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vst1.32 {d0}, [r0]!\n"
+        "vst1.32 {d1[0]}, [r0]!\n"
+
+        "vst1.32 {d6}, [r0]!\n"
+        "vst1.32 {d7[0]}, [r0]!\n"
+
+        "vst1.32 {d12}, [r0]!\n"
+        "vst1.32 {d13[0]}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+  }
+};
+
+// Fast kernel operating on int8 operands.
+// It is assumed that one of the two int8 operands only takes values
+// in [-127, 127], while the other may freely range in [-128, 127].
+// The issue with both operands taking the value -128 is that:
+// -128*-128 + -128*-128 == -32768 overflows int16.
+// Every other expression a*b + c*d, for any int8 a,b,c,d, fits in int16
+// range. That is the basic idea of this kernel.
+struct NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits {
+  typedef std::int8_t OperandType;
+  typedef std::int32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 16, CellOrder::WidthMajor>, 1>,
+      KernelSideFormat<CellFormat<2, 16, CellOrder::WidthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    std::size_t start_depth = 123;
+    std::size_t run_depth = depth;
+    std::size_t dst_col_stride = 4;
+    AccumulatorType* dst_ptr = accum_ptr;
+    asm volatile(
+
+        // Overview of register layout:
+        //
+        // A 2x16 block of Rhs is stored in 8 bit in d0--d3.
+        // A 4x16 block of Lhs is stored in 8 bit in d4--d7. That is only
+        // half of the register space required, so we loop over these registers
+        // twice. Only half of it, a 2x16 block, is stored in d4--d7 at
+        // any given time.
+        //
+        // A 4x2 block of accumulators is stored in q8--q15 (as 4x32 bit
+        // components which need to be horizontally-added at the end)
+        //
+        // The Lhs vectors are multiplied by the Rhs vectors with a widening
+        // multiply over the 8 first levels of depth, producing int16x8
+        // vectors of products for each position in the accumulator matrix.
+        // Here comes the special trick: since the operands are signed int8,
+        // their range being [ -2^7 , 2^7 ), their products are in range
+        // [ -2^14 , 2^14 - 1 ), meaning that we can add two such values
+        // without any risk of overflowing int16.
+        // We thus proceed with the 8 next levels of depth, multiplying
+        // again Lhs by Rhs, accumulating into this existing int16x8 vector.
+        //
+        // Only then, having processed 16 levels of depth, do we need to
+        // horizontally add these int16x8 accumulators into the final
+        // int32x4 accumulators.
+        //
+        // As we do not have enough registers to store all 16 int16x8
+        // temporary-16bit-accumulators, we have them cycle through q4--q7.
+        //
+        //
+        // Register layout (ignoring the q4--q7 temporary 16bit accumulators):
+        //
+        //                               +----+----+
+        //                               | d0 | d2 |
+        //                               | .  | .  |
+        //                               | .  | .  |
+        //                               | .  | .  |
+        //                       Rhs     +----+----+
+        //                               | d1 | d3 |
+        //                               | .  | .  |
+        //                               | .  | .  |
+        //                               | .  | .  |
+        //                               +----+----+
+        //
+        //                               |    |    |
+        //
+        //    Lhs                        |    |    |
+        //
+        //  +--------+--------+ - - - -  +----+----+
+        //  | d4 ... | d5 ... |          | q8 | q9 |
+        //  | d6 ... | d7 ... |          | q10| q11|
+        //  | d4 ... | d5 ... |          | q12| q13|
+        //  | d6 ... | d7 ... |          | q14| q15|
+        //  +--------+--------+ - - - -  +----+----+
+        //
+        //                               Accumulator
+        //
+
+        // Clear accumulators, and, interleaved with it,
+        // initial loads of the first loop iteration,
+        // taken out of the loop so that in the loop itself we have
+        // optimal streaming of data from memory.
+        "vldr d0, [%[rhs_ptr], #0]\n"
+        "vmov.i32 q8, #0\n"
+        "vldr d4, [%[lhs_ptr], #0]\n"
+        "vmov.i32 q9, #0\n"
+        "vldr d2, [%[rhs_ptr], #16]\n"
+        "vmov.i32 q10, q8\n"
+        "vldr d6, [%[lhs_ptr], #16]\n"
+        "vmov.i32 q11, q8\n"
+        "vldr d1, [%[rhs_ptr], #8]\n"
+        "vmov.i32 q12, q8\n"
+        "vldr d5, [%[lhs_ptr], #8]\n"
+        "vmov.i32 q13, q8\n"
+        "vldr d3, [%[rhs_ptr], #24]\n"
+        "vmov.i32 q14, q8\n"
+        "vldr d7, [%[lhs_ptr], #24]\n"
+        "vmov.i32 q15, q8\n"
+
+        // General loop.
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Multiply 8 first levels of depth.
+        "vmull.s8    q4,  d0,  d4\n"
+        "add %[rhs_ptr], %[rhs_ptr], #32\n"
+        "vmull.s8    q5,  d2,  d4\n"
+        "vldr d4, [%[lhs_ptr], #32]\n"
+        "vmull.s8    q6,  d0,  d6\n"
+        "vmull.s8    q7,  d2,  d6\n"
+        "vldr d6, [%[lhs_ptr], #48]\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "vmlal.s8    q4,  d1,  d5\n"
+        "vmlal.s8    q5,  d3,  d5\n"
+        "vldr d5, [%[lhs_ptr], #40]\n"
+        "vmlal.s8    q6,  d1,  d7\n"
+        "vmlal.s8    q7,  d3,  d7\n"
+        "vldr d7, [%[lhs_ptr], #56]\n"
+
+        // Add pairwise, accumulate into 32-bit accumulators.
+        "vpadal.s16   q8,  q4\n"
+        "add %[lhs_ptr], %[lhs_ptr], #64\n"
+        "vpadal.s16   q9,  q5\n"
+        "subs %[run_depth], %[run_depth], #16\n"
+        "vpadal.s16   q10, q6\n"
+        "vpadal.s16   q11, q7\n"
+
+        "beq " GEMMLOWP_LABEL_AFTER_LOOP
+        "f\n"
+
+        // Multiply first half.
+        "vmull.s8    q4,  d0,  d4\n"
+        "vmull.s8    q5,  d2,  d4\n"
+        "vldr d4, [%[lhs_ptr], #0]\n"
+        "vmull.s8    q6,  d0,  d6\n"
+        "vldr d0, [%[rhs_ptr], #0]\n"
+        "vmull.s8    q7,  d2,  d6\n"
+        "vldr d2, [%[rhs_ptr], #16]\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "vmlal.s8    q4,  d1,  d5\n"
+        "vldr d6, [%[lhs_ptr], #16]\n"
+        "vmlal.s8    q5,  d3,  d5\n"
+        "vldr d5, [%[lhs_ptr], #8]\n"
+        "vmlal.s8    q6,  d1,  d7\n"
+        "vldr d1, [%[rhs_ptr], #8]\n"
+        "vmlal.s8    q7,  d3,  d7\n"
+        "vldr d3, [%[rhs_ptr], #24]\n"
+
+        // Add pairwise, accumulate into 32-bit accumulators.
+        "vpadal.s16   q12, q4\n"
+        "vldr d7, [%[lhs_ptr], #24]\n"
+        "vpadal.s16   q13, q5\n"
+        "vpadal.s16   q14, q6\n"
+        "vpadal.s16   q15, q7\n"
+
+        "b " GEMMLOWP_LABEL_LOOP "b\n"
+
+        GEMMLOWP_LABEL_AFTER_LOOP
+        ":\n"
+
+        // Multiply first half.
+        "vmull.s8    q4,  d0,  d4\n"
+        "vmull.s8    q5,  d2,  d4\n"
+        "vmull.s8    q6,  d0,  d6\n"
+        "vmull.s8    q7,  d2,  d6\n"
+
+        // Multiply-accumulate second-half, again into the same
+        // 16bit local accumulator registers. This is where we
+        // take advantage of having int8 instead of uint8 and therefore
+        // being able to accumulate two products into int16.
+        "vmlal.s8    q4,  d1,  d5\n"
+        "vmlal.s8    q5,  d3,  d5\n"
+        "vmlal.s8    q6,  d1,  d7\n"
+        "vmlal.s8    q7,  d3,  d7\n"
+
+        // Add pairwise, accumulate into 32-bit accumulators.
+        "vpadal.s16   q12, q4\n"
+        "vpadal.s16   q13, q5\n"
+        "vpadal.s16   q14, q6\n"
+        "vpadal.s16   q15, q7\n"
+        "cmp %[start_depth], #0\n"
+
+        // Reduce 32bit accumulators horizontally.
+        "vpadd.s32 d0, d16, d17\n"
+        "vpadd.s32 d1, d18, d19\n"
+        "vpadd.s32 d2, d20, d21\n"
+        "vpadd.s32 d3, d22, d23\n"
+        "vpadd.s32 d4, d24, d25\n"
+        "vpadd.s32 d5, d26, d27\n"
+        "vpadd.s32 d6, d28, d29\n"
+        "vpadd.s32 d7, d30, d31\n"
+
+        "bne " GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES
+        "f\n"
+
+        // Reduce 32bit accumulators horizontally, second pass
+        // (each pass adds pairwise. we need to add 4-wise).
+        "vpadd.s32 d8, d0, d2\n"
+        "vpadd.s32 d9, d4, d6\n"
+        "vpadd.s32 d10, d1, d3\n"
+        "vpadd.s32 d11, d5, d7\n"
+
+        "b " GEMMLOWP_LABEL_STORE "f\n"
+
+        GEMMLOWP_LABEL_ACCUMULATE_EXISTING_DST_VALUES
+        ":\n"
+
+        // Reduce 32bit accumulators horizontally, second pass
+        // (each pass adds pairwise. we need to add 4-wise),
+        // and load destination values from memory.
+        "mov r0, %[dst_ptr]\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vpadd.s32 d8, d0, d2\n"
+        "vpadd.s32 d9, d4, d6\n"
+        "vld1.32 {d18, d19}, [r0]\n"
+        "vpadd.s32 d10, d1, d3\n"
+        "vpadd.s32 d11, d5, d7\n"
+
+        // Add horizontally-reduced accumulators into
+        // the values loaded from memory
+        "vadd.s32 q4, q8, q4\n"
+        "vadd.s32 q5, q9, q5\n"
+
+        GEMMLOWP_LABEL_STORE
+        ":\n"
+        // Store back into memory
+        "mov r0, %[dst_ptr]\n"
+        "vst1.32 {d8, d9}, [r0]!\n"
+        "vst1.32 {d10, d11}, [r0]\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [dst_ptr] "+r"(dst_ptr), [run_depth] "+r"(run_depth)
+        :  // inputs
+        [start_depth] "r"(start_depth)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+  }
+};
+
+// We don't actually use int32*int32 in production. This is just an
+// experiment to help dissociate the effect of integer-vs-float, from the
+// effect of operands width.
+struct NEON_32bit_GEMM_Int32_WithScalar {
+  typedef std::int32_t OperandType;
+  typedef std::int32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]!\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]!\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]!\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]!\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Load 1 Rhs cell of size 1x4
+        "vld1.32 {d0, d1}, [%[rhs_ptr]]!\n"
+
+        // Load 3 Lhs cells of size 4x1 each
+        "vld1.32 {d2, d3}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d4, d5}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d6, d7}, [%[lhs_ptr]]!\n"
+
+        // Multiply-accumulate
+        "vmla.s32 q4, q1, d0[0]\n"
+        "vmla.s32 q5, q1, d0[1]\n"
+        "vmla.s32 q6, q1, d1[0]\n"
+        "vmla.s32 q7, q1, d1[1]\n"
+        "vmla.s32 q8, q2, d0[0]\n"
+        "vmla.s32 q9, q2, d0[1]\n"
+        "vmla.s32 q10, q2, d1[0]\n"
+        "vmla.s32 q11, q2, d1[1]\n"
+        "vmla.s32 q12, q3, d0[0]\n"
+        "vmla.s32 q13, q3, d0[1]\n"
+        "vmla.s32 q14, q3, d1[0]\n"
+        "vmla.s32 q15, q3, d1[1]\n"
+
+        // Loop. Decrement loop index (depth) by 1, since we just handled 1
+        // level of depth.
+        "subs %[depth], #1\n"
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vst1.32 {d8, d9},   [r0]!\n"
+        "vst1.32 {d16, d17}, [r0]!\n"
+        "vst1.32 {d24, d25}, [r0]!\n"
+        "vst1.32 {d10, d11}, [r0]!\n"
+        "vst1.32 {d18, d19}, [r0]!\n"
+        "vst1.32 {d26, d27}, [r0]!\n"
+        "vst1.32 {d12, d13}, [r0]!\n"
+        "vst1.32 {d20, d21}, [r0]!\n"
+        "vst1.32 {d28, d29}, [r0]!\n"
+        "vst1.32 {d14, d15}, [r0]!\n"
+        "vst1.32 {d22, d23}, [r0]!\n"
+        "vst1.32 {d30, d31}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+  }
+};
+
+// Not very efficient kernel, just an experiment to see what we can do
+// without using NEON multiply-with-scalar instructions.
+struct NEON_32bit_GEMM_Float32_MLA_WithVectorDuplicatingScalar {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]!\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]!\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]!\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]!\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Load 3 Lhs cells of size 4x1 each
+        "vld1.32 {d2, d3}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d4, d5}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d6, d7}, [%[lhs_ptr]]!\n"
+
+        // Multiply-accumulate
+        "vld1.32 {d0[], d1[]}, [%[rhs_ptr]]!\n"
+        "vmla.f32 q4, q1, q0\n"
+        "vmla.f32 q8, q2, q0\n"
+        "vmla.f32 q12, q3, q0\n"
+        "vld1.32 {d0[], d1[]}, [%[rhs_ptr]]!\n"
+        "vmla.f32 q5, q1, q0\n"
+        "vmla.f32 q9, q2, q0\n"
+        "vmla.f32 q13, q3, q0\n"
+        "vld1.32 {d0[], d1[]}, [%[rhs_ptr]]!\n"
+        "vmla.f32 q6, q1, q0\n"
+        "vmla.f32 q10, q2, q0\n"
+        "vmla.f32 q14, q3, q0\n"
+        "vld1.32 {d0[], d1[]}, [%[rhs_ptr]]!\n"
+        "vmla.f32 q7, q1, q0\n"
+        "vmla.f32 q11, q2, q0\n"
+        "vmla.f32 q15, q3, q0\n"
+
+        // Loop. Decrement loop index (depth) by 1, since we just handled 1
+        // level of depth.
+        "subs %[depth], #1\n"
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vst1.32 {d8, d9},   [r0]!\n"
+        "vst1.32 {d16, d17}, [r0]!\n"
+        "vst1.32 {d24, d25}, [r0]!\n"
+        "vst1.32 {d10, d11}, [r0]!\n"
+        "vst1.32 {d18, d19}, [r0]!\n"
+        "vst1.32 {d26, d27}, [r0]!\n"
+        "vst1.32 {d12, d13}, [r0]!\n"
+        "vst1.32 {d20, d21}, [r0]!\n"
+        "vst1.32 {d28, d29}, [r0]!\n"
+        "vst1.32 {d14, d15}, [r0]!\n"
+        "vst1.32 {d22, d23}, [r0]!\n"
+        "vst1.32 {d30, d31}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+  }
+};
+
+// Not very efficient kernel, just an experiment to see what we can do
+// without using NEON multiply-with-scalar instructions.
+// This variant is relevant as on ARMv7 FMA does not have a with-scalar variant.
+struct NEON_32bit_GEMM_Float32_FMA_WithVectorDuplicatingScalar {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]!\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]!\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]!\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]!\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Load 3 Lhs cells of size 4x1 each
+        "vld1.32 {d2, d3}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d4, d5}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d6, d7}, [%[lhs_ptr]]!\n"
+
+        // Multiply-accumulate
+        "vld1.32 {d0[], d1[]}, [%[rhs_ptr]]!\n"
+        "vfma.f32 q4, q1, q0\n"
+        "vfma.f32 q8, q2, q0\n"
+        "vfma.f32 q12, q3, q0\n"
+        "vld1.32 {d0[], d1[]}, [%[rhs_ptr]]!\n"
+        "vfma.f32 q5, q1, q0\n"
+        "vfma.f32 q9, q2, q0\n"
+        "vfma.f32 q13, q3, q0\n"
+        "vld1.32 {d0[], d1[]}, [%[rhs_ptr]]!\n"
+        "vfma.f32 q6, q1, q0\n"
+        "vfma.f32 q10, q2, q0\n"
+        "vfma.f32 q14, q3, q0\n"
+        "vld1.32 {d0[], d1[]}, [%[rhs_ptr]]!\n"
+        "vfma.f32 q7, q1, q0\n"
+        "vfma.f32 q11, q2, q0\n"
+        "vfma.f32 q15, q3, q0\n"
+
+        // Loop. Decrement loop index (depth) by 1, since we just handled 1
+        // level of depth.
+        "subs %[depth], #1\n"
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vst1.32 {d8, d9},   [r0]!\n"
+        "vst1.32 {d16, d17}, [r0]!\n"
+        "vst1.32 {d24, d25}, [r0]!\n"
+        "vst1.32 {d10, d11}, [r0]!\n"
+        "vst1.32 {d18, d19}, [r0]!\n"
+        "vst1.32 {d26, d27}, [r0]!\n"
+        "vst1.32 {d12, d13}, [r0]!\n"
+        "vst1.32 {d20, d21}, [r0]!\n"
+        "vst1.32 {d28, d29}, [r0]!\n"
+        "vst1.32 {d14, d15}, [r0]!\n"
+        "vst1.32 {d22, d23}, [r0]!\n"
+        "vst1.32 {d30, d31}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+  }
+};
+
+// This is the "most natural" kernel, using NEON multiply-with-scalar
+// instructions.
+struct NEON_32bit_GEMM_Float32_MLA_WithScalar {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]!\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]!\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]!\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]!\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Load 1 Rhs cell of size 1x4
+        "vld1.32 {d0, d1}, [%[rhs_ptr]]!\n"
+
+        // Load 3 Lhs cells of size 4x1 each
+        "vld1.32 {d2, d3}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d4, d5}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d6, d7}, [%[lhs_ptr]]!\n"
+
+        // Multiply-accumulate
+        "vmla.f32 q4, q1, d0[0]\n"
+        "vmla.f32 q5, q1, d0[1]\n"
+        "vmla.f32 q6, q1, d1[0]\n"
+        "vmla.f32 q7, q1, d1[1]\n"
+        "vmla.f32 q8, q2, d0[0]\n"
+        "vmla.f32 q9, q2, d0[1]\n"
+        "vmla.f32 q10, q2, d1[0]\n"
+        "vmla.f32 q11, q2, d1[1]\n"
+        "vmla.f32 q12, q3, d0[0]\n"
+        "vmla.f32 q13, q3, d0[1]\n"
+        "vmla.f32 q14, q3, d1[0]\n"
+        "vmla.f32 q15, q3, d1[1]\n"
+
+        // Loop. Decrement loop index (depth) by 1, since we just handled 1
+        // level of depth.
+        "subs %[depth], #1\n"
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vst1.32 {d8, d9},   [r0]!\n"
+        "vst1.32 {d16, d17}, [r0]!\n"
+        "vst1.32 {d24, d25}, [r0]!\n"
+        "vst1.32 {d10, d11}, [r0]!\n"
+        "vst1.32 {d18, d19}, [r0]!\n"
+        "vst1.32 {d26, d27}, [r0]!\n"
+        "vst1.32 {d12, d13}, [r0]!\n"
+        "vst1.32 {d20, d21}, [r0]!\n"
+        "vst1.32 {d28, d29}, [r0]!\n"
+        "vst1.32 {d14, d15}, [r0]!\n"
+        "vst1.32 {d22, d23}, [r0]!\n"
+        "vst1.32 {d30, d31}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+  }
+};
+
+// Faster kernel contributed by ARM in 64bit form
+// (see NEON_64bit_GEMM_Float32_WithScalar_A53) then ported to 32bit code.
+// Tuned for A53.
+struct NEON_32bit_GEMM_Float32_WithScalar_A53 {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]!\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]!\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]!\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]!\n"
+
+        // Overview of register layout:
+        //
+        // A 1x4 cell of Rhs is stored in d0--d1 (q0).
+        // A 12x1 block of 3 4x1 cells Lhs is stored in d2--d7
+        // (q1--q3).
+        // A 12x4 block of accumulators is stored in q4--q15.
+        //
+        //                   +-----+-----+-----+-----+
+        //             Rhs   |d0[0]|d0[1]|d1[0]|d1[1]|
+        //                   +-----+-----+-----+-----+
+        //
+        //                   |     |     |     |     |
+        //
+        //  Lhs              |     |     |     |     |
+        //
+        //  +--+- - - - - -  +-----+-----+-----+-----+
+        //  |d2|             | q4  | q5  | q6  | q7  |
+        //  |d2|             | q4  | q5  | q6  | q7  |
+        //  |d3|             | q4  | q5  | q6  | q7  |
+        //  |d3|             | q4  | q5  | q6  | q7  |
+        //  +--+- - - - - -  +-----+-----+-----+-----+
+        //  |d4|             | q8  | q9  | q10 | q11 |
+        //  |d4|             | q8  | q9  | q10 | q11 |
+        //  |d5|             | q8  | q9  | q10 | q11 |
+        //  |d5|             | q8  | q9  | q10 | q11 |
+        //  +--+ - - - - - - +-----+-----+-----+-----+
+        //  |d6|             | q12 | q13 | q14 | q15 |
+        //  |d6|             | q12 | q13 | q14 | q15 |
+        //  |d7|             | q12 | q13 | q14 | q15 |
+        //  |d7|             | q12 | q13 | q14 | q15 |
+        //  +--+- - - - - -  +-----+-----+-----+-----+
+        //
+        //                            Accumulator
+
+        // Load Rhs cell
+        "vldr d0, [%[rhs_ptr]]\n"
+        "ldr r2, [%[rhs_ptr], #8]\n"
+        "ldr r3, [%[rhs_ptr], #12]\n"
+
+        // Load 1st Lhs Cell
+        "vld1.32 {d2, d3}, [%[lhs_ptr]]\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        "vldr d4, [%[lhs_ptr], #16]\n"  // Load 1st half of 2nd Lhs cell
+        "vmov d1, r2, r3\n"             // Prepare 2nd half of Rhs cell
+        "vmla.f32 q4, q1, d0[0]\n"      // Multiply 1st Lhs cell with column 0
+        "ldr r2, [%[lhs_ptr], #24]\n"   // Load 2nd half of 2nd Lhs cell, part 1
+        "vmla.f32 q5, q1, d0[1]\n"      // Multiply 1st Lhs cell with column 1
+        "ldr r3, [%[lhs_ptr], #28]\n"   // Load 2nd half of 2nd Lhs cell, part 2
+        "vmla.f32 q6, q1, d1[0]\n"      // Multiply 1st Lhs cell with column 2
+        "subs %[depth], #1\n"
+
+        "vldr d6, [%[lhs_ptr], #32]\n"  // Load 1st half of 3rd Lhs cell
+        "vmov d5, r2, r3\n"             // Prepare 2nd half of 2nd Lhs cell
+        "vmla.f32 q7, q1, d1[1]\n"      // Multiply 1st Lhs cell with column 3
+        "ldr r2, [%[lhs_ptr], #40]\n"   // Load 2nd half of 3rd Lhs cell, part 1
+        "vmla.f32 q8, q2, d0[0]\n"      // Multiply 2nd Lhs cell with column 0
+        "ldr r3, [%[lhs_ptr], #44]\n"   // Load 2nd half of 3rd Lhs cell, part 2
+        "vmla.f32 q9, q2, d0[1]\n"      // Multiply 2nd Lhs cell with column 1
+        "add %[rhs_ptr], %[rhs_ptr], #16\n"  // Move forward by 1 Rhs cell
+
+        "vldr d2, [%[lhs_ptr], #48]\n"  // Load 1st half of 1st Lhs cell of next
+        // iteration
+        "vmov d7, r2, r3\n"            // Prepare 2nd half of 3rd Lhs cell
+        "vmla.f32 q10, q2, d1[0]\n"    // Multiply 2nd Lhs cell with column 2
+        "ldr r2, [%[lhs_ptr], #56]\n"  // Load 2nd half of 1st Lhs cell of next
+        // iter, part 1
+        "vmla.f32 q12, q3, d0[0]\n"    // Multiply 3rd Lhs cell with column 0
+        "ldr r3, [%[lhs_ptr], #60]\n"  // Load 2nd half of 1st Lhs cell of next
+        // iter, part 2
+        "vmla.f32 q13, q3, d0[1]\n"  // Multiply 3rd Lhs cell with column 1
+        "add %[lhs_ptr], %[lhs_ptr], #48\n"  // Move forward by 3 Lhs cells
+
+        "vldr d0, [%[rhs_ptr]]\n"  // Load 1st half of Rhs cell of next
+        // iteration
+        "vmov d3, r2, r3\n"  // Prepare 2nd half of 1st Lhs cell of next
+        // iteration
+        "vmla.f32 q11, q2, d1[1]\n"   // Multiply 2nd Lhs cell with column 3
+        "ldr r2, [%[rhs_ptr], #8]\n"  // Load 2nd half of Rhs cell of next
+        // iteration, part 1
+        "vmla.f32 q14, q3, d1[0]\n"    // Multiply 3rd Lhs cell with column 2
+        "ldr r3, [%[rhs_ptr], #12]\n"  // Load 2nd half of Rhs cell of next
+        // iteration, part 2
+        "vmla.f32 q15, q3, d1[1]\n"  // Multiply 3rd Lhs cell with column 3
+
+        // Loop branch.  This will dual issue in fmla cycle 3 of the 4th block.
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vst1.32 {d8, d9},   [r0]!\n"
+        "vst1.32 {d16, d17}, [r0]!\n"
+        "vst1.32 {d24, d25}, [r0]!\n"
+        "vst1.32 {d10, d11}, [r0]!\n"
+        "vst1.32 {d18, d19}, [r0]!\n"
+        "vst1.32 {d26, d27}, [r0]!\n"
+        "vst1.32 {d12, d13}, [r0]!\n"
+        "vst1.32 {d20, d21}, [r0]!\n"
+        "vst1.32 {d28, d29}, [r0]!\n"
+        "vst1.32 {d14, d15}, [r0]!\n"
+        "vst1.32 {d22, d23}, [r0]!\n"
+        "vst1.32 {d30, d31}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "d28", "d29", "d30", "d31");
+  }
+};
+
+struct NEON_32bit_GEMM_Float32_WithScalar_A53_depth2 {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 2, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 2, CellOrder::DepthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]!\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]!\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]!\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]!\n"
+
+        // Overview of register layout:
+        //
+        // A 1x4 cell of Rhs is stored in d0--d1 (q0).
+        // A 12x1 block of 3 4x1 cells Lhs is stored in d2--d7
+        // (q1--q3).
+        // A 12x4 block of accumulators is stored in q4--q15.
+        //
+        //                   +-----+-----+-----+-----+
+        //             Rhs   |d0[0]|d0[1]|d1[0]|d1[1]|
+        //                   +-----+-----+-----+-----+
+        //
+        //                   |     |     |     |     |
+        //
+        //  Lhs              |     |     |     |     |
+        //
+        //  +--+- - - - - -  +-----+-----+-----+-----+
+        //  |d2|             | q4  | q5  | q6  | q7  |
+        //  |d2|             | q4  | q5  | q6  | q7  |
+        //  |d3|             | q4  | q5  | q6  | q7  |
+        //  |d3|             | q4  | q5  | q6  | q7  |
+        //  +--+- - - - - -  +-----+-----+-----+-----+
+        //  |d4|             | q8  | q9  | q10 | q11 |
+        //  |d4|             | q8  | q9  | q10 | q11 |
+        //  |d5|             | q8  | q9  | q10 | q11 |
+        //  |d5|             | q8  | q9  | q10 | q11 |
+        //  +--+ - - - - - - +-----+-----+-----+-----+
+        //  |d6|             | q12 | q13 | q14 | q15 |
+        //  |d6|             | q12 | q13 | q14 | q15 |
+        //  |d7|             | q12 | q13 | q14 | q15 |
+        //  |d7|             | q12 | q13 | q14 | q15 |
+        //  +--+- - - - - -  +-----+-----+-----+-----+
+        //
+        //                            Accumulator
+
+        // Load Rhs cell
+        "vldr d0, [%[rhs_ptr]]\n"
+        "ldr r2, [%[rhs_ptr], #8]\n"
+        "ldr r3, [%[rhs_ptr], #12]\n"
+
+        // Load 1st Lhs Cell
+        "vld1.32 {d2, d3}, [%[lhs_ptr]]\n"
+
+        // Loop head - handling 2 levels of depth at once
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Level of depth 1
+
+        "vldr d4, [%[lhs_ptr], #32]\n"  // Load 1st half of 2nd Lhs cell
+        "vmov d1, r2, r3\n"             // Prepare 2nd half of Rhs cell
+        "vmla.f32 q4, q1, d0[0]\n"      // Multiply 1st Lhs cell with column 0
+        "ldr r2, [%[lhs_ptr], #40]\n"   // Load 2nd half of 2nd Lhs cell, part 1
+        "vmla.f32 q5, q1, d0[1]\n"      // Multiply 1st Lhs cell with column 1
+        "ldr r3, [%[lhs_ptr], #44]\n"   // Load 2nd half of 2nd Lhs cell, part 2
+        "vmla.f32 q6, q1, d1[0]\n"      // Multiply 1st Lhs cell with column 2
+
+        "vldr d6, [%[lhs_ptr], #64]\n"  // Load 1st half of 3rd Lhs cell
+        "vmov d5, r2, r3\n"             // Prepare 2nd half of 2nd Lhs cell
+        "vmla.f32 q7, q1, d1[1]\n"      // Multiply 1st Lhs cell with column 3
+        "ldr r2, [%[lhs_ptr], #72]\n"   // Load 2nd half of 3rd Lhs cell, part 1
+        "vmla.f32 q8, q2, d0[0]\n"      // Multiply 2nd Lhs cell with column 0
+        "ldr r3, [%[lhs_ptr], #76]\n"   // Load 2nd half of 3rd Lhs cell, part 2
+        "vmla.f32 q9, q2, d0[1]\n"      // Multiply 2nd Lhs cell with column 1
+
+        "vldr d2, [%[lhs_ptr], #16]\n"  // Load 1st half of 1st Lhs cell of next
+        // iteration
+        "vmov d7, r2, r3\n"            // Prepare 2nd half of 3rd Lhs cell
+        "vmla.f32 q10, q2, d1[0]\n"    // Multiply 2nd Lhs cell with column 2
+        "ldr r2, [%[lhs_ptr], #24]\n"  // Load 2nd half of 1st Lhs cell of next
+        // iter, part 1
+        "vmla.f32 q12, q3, d0[0]\n"    // Multiply 3rd Lhs cell with column 0
+        "ldr r3, [%[lhs_ptr], #28]\n"  // Load 2nd half of 1st Lhs cell of next
+        // iter, part 2
+        "vmla.f32 q13, q3, d0[1]\n"  // Multiply 3rd Lhs cell with column 1
+
+        "vldr d0, [%[rhs_ptr], #16]\n"  // Load 1st half of Rhs cell of next
+        // iteration
+        "vmov d3, r2, r3\n"  // Prepare 2nd half of 1st Lhs cell of next
+        // iteration
+        "vmla.f32 q11, q2, d1[1]\n"    // Multiply 2nd Lhs cell with column 3
+        "ldr r2, [%[rhs_ptr], #24]\n"  // Load 2nd half of Rhs cell of next
+        // iteration, part 1
+        "vmla.f32 q14, q3, d1[0]\n"    // Multiply 3rd Lhs cell with column 2
+        "ldr r3, [%[rhs_ptr], #28]\n"  // Load 2nd half of Rhs cell of next
+        // iteration, part 2
+        "vmla.f32 q15, q3, d1[1]\n"  // Multiply 3rd Lhs cell with column 3
+
+        // Level of depth 2
+        "vldr d4, [%[lhs_ptr], #48]\n"  // Load 1st half of 2nd Lhs cell
+        "vmov d1, r2, r3\n"             // Prepare 2nd half of Rhs cell
+        "vmla.f32 q4, q1, d0[0]\n"      // Multiply 1st Lhs cell with column 0
+        "ldr r2, [%[lhs_ptr], #56]\n"   // Load 2nd half of 2nd Lhs cell, part 1
+        "vmla.f32 q5, q1, d0[1]\n"      // Multiply 1st Lhs cell with column 1
+        "ldr r3, [%[lhs_ptr], #60]\n"   // Load 2nd half of 2nd Lhs cell, part 2
+        "vmla.f32 q6, q1, d1[0]\n"      // Multiply 1st Lhs cell with column 2
+        "subs %[depth], #2\n"           // Decrement depth counter
+
+        "vldr d6, [%[lhs_ptr], #80]\n"  // Load 1st half of 3rd Lhs cell
+        "vmov d5, r2, r3\n"             // Prepare 2nd half of 2nd Lhs cell
+        "vmla.f32 q7, q1, d1[1]\n"      // Multiply 1st Lhs cell with column 3
+        "ldr r2, [%[lhs_ptr], #88]\n"   // Load 2nd half of 3rd Lhs cell, part 1
+        "vmla.f32 q8, q2, d0[0]\n"      // Multiply 2nd Lhs cell with column 0
+        "ldr r3, [%[lhs_ptr], #92]\n"   // Load 2nd half of 3rd Lhs cell, part 2
+        "vmla.f32 q9, q2, d0[1]\n"      // Multiply 2nd Lhs cell with column 1
+        "add %[rhs_ptr], %[rhs_ptr], #32\n"  // Move forward by 1 Rhs cell
+
+        "vldr d2, [%[lhs_ptr], #96]\n"  // Load 1st half of 1st Lhs cell of next
+        // iteration
+        "vmov d7, r2, r3\n"             // Prepare 2nd half of 3rd Lhs cell
+        "vmla.f32 q10, q2, d1[0]\n"     // Multiply 2nd Lhs cell with column 2
+        "ldr r2, [%[lhs_ptr], #104]\n"  // Load 2nd half of 1st Lhs cell of next
+        // iter, part 1
+        "vmla.f32 q12, q3, d0[0]\n"     // Multiply 3rd Lhs cell with column 0
+        "ldr r3, [%[lhs_ptr], #108]\n"  // Load 2nd half of 1st Lhs cell of next
+        // iter, part 2
+        "vmla.f32 q13, q3, d0[1]\n"  // Multiply 3rd Lhs cell with column 1
+        "add %[lhs_ptr], %[lhs_ptr], #96\n"  // Move forward by 3 Lhs cells
+
+        "vldr d0, [%[rhs_ptr]]\n"  // Load 1st half of Rhs cell of next
+        // iteration
+        "vmov d3, r2, r3\n"  // Prepare 2nd half of 1st Lhs cell of next
+        // iteration
+        "vmla.f32 q11, q2, d1[1]\n"   // Multiply 2nd Lhs cell with column 3
+        "ldr r2, [%[rhs_ptr], #8]\n"  // Load 2nd half of Rhs cell of next
+        // iteration, part 1
+        "vmla.f32 q14, q3, d1[0]\n"    // Multiply 3rd Lhs cell with column 2
+        "ldr r3, [%[rhs_ptr], #12]\n"  // Load 2nd half of Rhs cell of next
+        // iteration, part 2
+        "vmla.f32 q15, q3, d1[1]\n"  // Multiply 3rd Lhs cell with column 3
+
+        // Loop branch.  This will dual issue in fmla cycle 3 of the 4th block.
+        //"bne loop_%=\n"
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vst1.32 {d8, d9},   [r0]!\n"
+        "vst1.32 {d16, d17}, [r0]!\n"
+        "vst1.32 {d24, d25}, [r0]!\n"
+        "vst1.32 {d10, d11}, [r0]!\n"
+        "vst1.32 {d18, d19}, [r0]!\n"
+        "vst1.32 {d26, d27}, [r0]!\n"
+        "vst1.32 {d12, d13}, [r0]!\n"
+        "vst1.32 {d20, d21}, [r0]!\n"
+        "vst1.32 {d28, d29}, [r0]!\n"
+        "vst1.32 {d14, d15}, [r0]!\n"
+        "vst1.32 {d22, d23}, [r0]!\n"
+        "vst1.32 {d30, d31}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "r2", "r3", "d0", "d1", "d2", "d3", "d4", "d5",
+        "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16",
+        "d17", "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26",
+        "d27", "d28", "d29", "d30", "d31");
+  }
+};
+
+// This rotating variant performs well when permutations (vext) can be
+// dual-issued with arithmetic instructions.
+struct NEON_32bit_GEMM_Float32_MLA_Rotating {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]!\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]!\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]!\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]!\n"
+
+#define NEON_32BIT_ROTATING_FLOAT_KERNEL_TRANSPOSE_ACCUMULATOR_CELLS \
+  "vtrn.32 q4, q5\n"                                                 \
+  "vtrn.32 q6, q7\n"                                                 \
+  "vswp d9, d12\n"                                                   \
+  "vswp d11, d14\n"                                                  \
+  "vtrn.32 q8, q9\n"                                                 \
+  "vtrn.32 q10, q11\n"                                               \
+  "vswp d17, d20\n"                                                  \
+  "vswp d19, d22\n"                                                  \
+  "vtrn.32 q12, q13\n"                                               \
+  "vtrn.32 q14, q15\n"                                               \
+  "vswp d25, d28\n"                                                  \
+  "vswp d27, d30\n"
+
+#define NEON_32BIT_ROTATING_FLOAT_KERNEL_ROTATE_ACCUMULATOR_CELLS(a, b, c) \
+  NEON_32BIT_ROTATING_FLOAT_KERNEL_TRANSPOSE_ACCUMULATOR_CELLS             \
+  "vext.32 q5, q5, q5, #" #a                                               \
+  "\n"                                                                     \
+  "vext.32 q6, q6, q6, #" #b                                               \
+  "\n"                                                                     \
+  "vext.32 q7, q7, q7, #" #c                                               \
+  "\n"                                                                     \
+  "vext.32 q9, q9, q9, #" #a                                               \
+  "\n"                                                                     \
+  "vext.32 q10, q10, q10, #" #b                                            \
+  "\n"                                                                     \
+  "vext.32 q11, q11, q11, #" #c                                            \
+  "\n"                                                                     \
+  "vext.32 q13, q13, q13, #" #a                                            \
+  "\n"                                                                     \
+  "vext.32 q14, q14, q14, #" #b                                            \
+  "\n"                                                                     \
+  "vext.32 q15, q15, q15, #" #c                                            \
+  "\n" NEON_32BIT_ROTATING_FLOAT_KERNEL_TRANSPOSE_ACCUMULATOR_CELLS
+
+        NEON_32BIT_ROTATING_FLOAT_KERNEL_ROTATE_ACCUMULATOR_CELLS(1, 2, 3)
+
+        //"loop_%=:\n"
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Load 1 Rhs cell of size 1x4
+        "vld1.32 {d0, d1}, [%[rhs_ptr]]!\n"
+
+        // Load 3 Lhs cells of size 4x1 each
+        "vld1.32 {d2, d3}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d4, d5}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d6, d7}, [%[lhs_ptr]]!\n"
+
+        // Multiply-accumulate
+        "vmla.f32 q4, q1, q0\n"
+        "vmla.f32 q8, q2, q0\n"
+        "vmla.f32 q12, q3, q0\n"
+        "vext.f32 q0, q0, q0, #1\n"
+        "vmla.f32 q5, q1, q0\n"
+        "vmla.f32 q9, q2, q0\n"
+        "vmla.f32 q13, q3, q0\n"
+        "vext.f32 q0, q0, q0, #1\n"
+        "vmla.f32 q6, q1, q0\n"
+        "vmla.f32 q10, q2, q0\n"
+        "vmla.f32 q14, q3, q0\n"
+        "vext.f32 q0, q0, q0, #1\n"
+        "vmla.f32 q7, q1, q0\n"
+        "vmla.f32 q11, q2, q0\n"
+        "vmla.f32 q15, q3, q0\n"
+
+        // Loop. Decrement loop index (depth) by 1, since we just handled 1
+        // level of depth.
+        "subs %[depth], #1\n"
+        //"bne loop_%=\n"
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+
+        NEON_32BIT_ROTATING_FLOAT_KERNEL_ROTATE_ACCUMULATOR_CELLS(3, 2, 1)
+
+            "vst1.32 {d8, d9},   [r0]!\n"
+            "vst1.32 {d16, d17}, [r0]!\n"
+            "vst1.32 {d24, d25}, [r0]!\n"
+            "vst1.32 {d10, d11}, [r0]!\n"
+            "vst1.32 {d18, d19}, [r0]!\n"
+            "vst1.32 {d26, d27}, [r0]!\n"
+            "vst1.32 {d12, d13}, [r0]!\n"
+            "vst1.32 {d20, d21}, [r0]!\n"
+            "vst1.32 {d28, d29}, [r0]!\n"
+            "vst1.32 {d14, d15}, [r0]!\n"
+            "vst1.32 {d22, d23}, [r0]!\n"
+            "vst1.32 {d30, d31}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+  }
+};
+
+// This rotating variant performs well when permutations (vext) can be
+// dual-issued with arithmetic instructions. It is relevant as the rotating
+// approach removes the need for multiply-with-scalar instructions, and ARMv7
+// FMA does not have a with-scalar variant.
+struct NEON_32bit_GEMM_Float32_FMA_Rotating {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vld1.32 {d8, d9},   [r0]!\n"
+        "vld1.32 {d16, d17}, [r0]!\n"
+        "vld1.32 {d24, d25}, [r0]!\n"
+        "vld1.32 {d10, d11}, [r0]!\n"
+        "vld1.32 {d18, d19}, [r0]!\n"
+        "vld1.32 {d26, d27}, [r0]!\n"
+        "vld1.32 {d12, d13}, [r0]!\n"
+        "vld1.32 {d20, d21}, [r0]!\n"
+        "vld1.32 {d28, d29}, [r0]!\n"
+        "vld1.32 {d14, d15}, [r0]!\n"
+        "vld1.32 {d22, d23}, [r0]!\n"
+        "vld1.32 {d30, d31}, [r0]!\n"
+
+        NEON_32BIT_ROTATING_FLOAT_KERNEL_ROTATE_ACCUMULATOR_CELLS(1, 2, 3)
+
+        //"loop_%=:\n"
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Load 1 Rhs cell of size 1x4
+        "vld1.32 {d0, d1}, [%[rhs_ptr]]!\n"
+
+        // Load 3 Lhs cells of size 4x1 each
+        "vld1.32 {d2, d3}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d4, d5}, [%[lhs_ptr]]!\n"
+        "vld1.32 {d6, d7}, [%[lhs_ptr]]!\n"
+
+        // Multiply-accumulate
+        "vfma.f32 q4, q1, q0\n"
+        "vfma.f32 q8, q2, q0\n"
+        "vfma.f32 q12, q3, q0\n"
+        "vext.f32 q0, q0, q0, #1\n"
+        "vfma.f32 q5, q1, q0\n"
+        "vfma.f32 q9, q2, q0\n"
+        "vfma.f32 q13, q3, q0\n"
+        "vext.f32 q0, q0, q0, #1\n"
+        "vfma.f32 q6, q1, q0\n"
+        "vfma.f32 q10, q2, q0\n"
+        "vfma.f32 q14, q3, q0\n"
+        "vext.f32 q0, q0, q0, #1\n"
+        "vfma.f32 q7, q1, q0\n"
+        "vfma.f32 q11, q2, q0\n"
+        "vfma.f32 q15, q3, q0\n"
+
+        // Loop. Decrement loop index (depth) by 1, since we just handled 1
+        // level of depth.
+        "subs %[depth], #1\n"
+        //"bne loop_%=\n"
+        "bne " GEMMLOWP_LABEL_LOOP "b\n"
+
+        NEON_32BIT_ROTATING_FLOAT_KERNEL_ROTATE_ACCUMULATOR_CELLS(3, 2, 1)
+
+        // Store accumulators
+        "mov r0, %[accum_ptr]\n"
+        "vst1.32 {d8, d9},   [r0]!\n"
+        "vst1.32 {d16, d17}, [r0]!\n"
+        "vst1.32 {d24, d25}, [r0]!\n"
+        "vst1.32 {d10, d11}, [r0]!\n"
+        "vst1.32 {d18, d19}, [r0]!\n"
+        "vst1.32 {d26, d27}, [r0]!\n"
+        "vst1.32 {d12, d13}, [r0]!\n"
+        "vst1.32 {d20, d21}, [r0]!\n"
+        "vst1.32 {d28, d29}, [r0]!\n"
+        "vst1.32 {d14, d15}, [r0]!\n"
+        "vst1.32 {d22, d23}, [r0]!\n"
+        "vst1.32 {d30, d31}, [r0]!\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "r0", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
+        "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17",
+        "d18", "d19", "d20", "d21", "d22", "d23", "d24", "d25", "d26", "d27",
+        "d28", "d29", "d30", "d31");
+  }
+};
+
+#endif  // __arm__
+
+#ifdef __aarch64__
+
+// This is the current standard kernel in gemmlowp, see:
+// https://github.com/google/gemmlowp/blob/b1e2a29ff866680028f3080efc244e10e8dd7f46/internal/kernel_neon.h#L646
+struct NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators {
+  typedef std::uint8_t OperandType;
+  typedef std::uint32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 2, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 2, CellOrder::DepthMajor>, 2> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load 1 Rhs cell of size 2x8
+        "ld1 {v5.8b}, [%[rhs_ptr]], #8\n"
+        "ld1 {v6.8b}, [%[rhs_ptr]], #8\n"
+
+        // Load 3 Lhs cells of size 4x2 each
+        "ld1 {v2.8b}, [%[lhs_ptr]], #8\n"
+        "ld1 {v3.8b}, [%[lhs_ptr]], #8\n"
+        "ld1 {v4.8b}, [%[lhs_ptr]], #8\n"
+
+        "subs %w[depth], %w[depth], #2\n"
+
+        // Load accumulators
+        "mov x0, %[accum_ptr]\n"
+        "ld1 {v8.16b}, [x0], #16\n"
+        "ld1 {v16.16b}, [x0], #16\n"
+        "ld1 {v24.16b}, [x0], #16\n"
+        "ld1 {v9.16b}, [x0], #16\n"
+        "ld1 {v17.16b}, [x0], #16\n"
+        "ld1 {v25.16b}, [x0], #16\n"
+        "ld1 {v10.16b}, [x0], #16\n"
+        "ld1 {v18.16b}, [x0], #16\n"
+        "ld1 {v26.16b}, [x0], #16\n"
+        "ld1 {v11.16b}, [x0], #16\n"
+        "ld1 {v19.16b}, [x0], #16\n"
+        "ld1 {v27.16b}, [x0], #16\n"
+        "ld1 {v12.16b}, [x0], #16\n"
+        "ld1 {v20.16b}, [x0], #16\n"
+        "ld1 {v28.16b}, [x0], #16\n"
+        "ld1 {v13.16b}, [x0], #16\n"
+        "ld1 {v21.16b}, [x0], #16\n"
+        "ld1 {v29.16b}, [x0], #16\n"
+        "ld1 {v14.16b}, [x0], #16\n"
+        "ld1 {v22.16b}, [x0], #16\n"
+        "ld1 {v30.16b}, [x0], #16\n"
+        "ld1 {v15.16b}, [x0], #16\n"
+        "ld1 {v23.16b}, [x0], #16\n"
+        "ld1 {v31.16b}, [x0], #16\n"
+
+        "beq " GEMMLOWP_LABEL_AFTER_LOOP "f\n"
+
+        //"loop_%=:\n"
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Overview of register layout:
+        //
+        // A 2x8 block of 2 2x4 cells of Rhs is stored in 16bit in v0--v1.
+        // A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in v2--v4.
+        // A 12x8 block of accumulators is stored in 32bit in v8--v31.
+        //
+        //                         +--------+--------+-----+--------+--------+
+        //                         |v0.h[0] |v0.h[1] | ... |v1.h[2] |v1.h[3] |
+        //                    Rhs  +--------+--------+-----+--------+--------+
+        //                         |v0.h[4] |v0.h[5] | ... |v1.h[6] |v1.h[7] |
+        //                         +--------+--------+-----+--------+--------+
+        //
+        //                         |        |        |     |        |        |
+        //
+        //    Lhs                  |        |        |     |        |        |
+        //
+        //  +-------+-------+ - -  +--------+--------+-----+--------+--------+
+        //  |v2.h[0]|v2.h[4]|      |v8.s[0] |v9.s[0] | ... |v14.s[0]|v15.s[0]|
+        //  |v2.h[1]|v2.h[5]|      |v8.s[1] |v9.s[1] | ... |v14.s[1]|v15.s[1]|
+        //  |v2.h[2]|v2.h[6]|      |v8.s[2] |v9.s[2] | ... |v14.s[2]|v15.s[2]|
+        //  |v2.h[3]|v2.h[7]|      |v8.s[3] |v9.s[3] | ... |v14.s[3]|v15.s[3]|
+        //  +-------+-------+ - -  +--------+--------+-----+--------+--------+
+        //  |v3.h[0]|v3.h[4]|      |v16.s[0]|v17.s[0]| ... |v22.s[0]|v23.s[0]|
+        //  |v3.h[1]|v3.h[5]|      |v16.s[1]|v17.s[1]| ... |v22.s[1]|v23.s[1]|
+        //  |v3.h[2]|v3.h[6]|      |v16.s[2]|v17.s[2]| ... |v22.s[2]|v23.s[2]|
+        //  |v3.h[3]|v3.h[7]|      |v16.s[3]|v17.s[3]| ... |v22.s[3]|v23.s[3]|
+        //  +-------+-------+ - -  +--------+--------+-----+--------+--------+
+        //  |v4.h[0]|v4.h[4]|      |v24.s[0]|v25.s[0]| ... |v30.s[0]|v31.s[0]|
+        //  |v4.h[1]|v4.h[5]|      |v24.s[1]|v25.s[1]| ... |v30.s[1]|v31.s[1]|
+        //  |v4.h[2]|v4.h[6]|      |v24.s[2]|v25.s[2]| ... |v30.s[2]|v31.s[2]|
+        //  |v4.h[3]|v4.h[7]|      |v24.s[3]|v25.s[3]| ... |v30.s[3]|v31.s[3]|
+        //  +-------+-------+ - -  +--------+--------+-----+--------+--------+
+        //
+        //                            Accumulator
+
+        // Expand Lhs/Rhs cells to 16 bit.
+        "uxtl v0.8h, v5.8b\n"
+        "ld1 {v5.8b}, [%[rhs_ptr]], #8\n"
+        "uxtl v1.8h, v6.8b\n"
+        "ld1 {v6.8b}, [%[rhs_ptr]], #8\n"
+        "uxtl v2.8h, v2.8b\n"
+        "uxtl v3.8h, v3.8b\n"
+        "uxtl v4.8h, v4.8b\n"
+
+        // Multiply-accumulate, top third
+        "umlal v8.4s, v2.4h, v0.h[0]\n"
+        "umlal v9.4s, v2.4h, v0.h[1]\n"
+        "umlal v10.4s, v2.4h, v0.h[2]\n"
+        "umlal v11.4s, v2.4h, v0.h[3]\n"
+        "umlal v12.4s, v2.4h, v1.h[0]\n"
+        "umlal v13.4s, v2.4h, v1.h[1]\n"
+        "umlal v14.4s, v2.4h, v1.h[2]\n"
+        "umlal v15.4s, v2.4h, v1.h[3]\n"
+        "umlal2 v8.4s, v2.8h, v0.h[4]\n"
+        "umlal2 v9.4s, v2.8h, v0.h[5]\n"
+        "umlal2 v10.4s, v2.8h, v0.h[6]\n"
+        "umlal2 v11.4s, v2.8h, v0.h[7]\n"
+        "umlal2 v12.4s, v2.8h, v1.h[4]\n"
+        "umlal2 v13.4s, v2.8h, v1.h[5]\n"
+        "umlal2 v14.4s, v2.8h, v1.h[6]\n"
+        "umlal2 v15.4s, v2.8h, v1.h[7]\n"
+        "ld1 {v2.8b}, [%[lhs_ptr]], #8\n"
+
+        // Multiply-accumulate, middle third
+        "umlal v16.4s, v3.4h, v0.h[0]\n"
+        "umlal v17.4s, v3.4h, v0.h[1]\n"
+        "umlal v18.4s, v3.4h, v0.h[2]\n"
+        "umlal v19.4s, v3.4h, v0.h[3]\n"
+        "umlal v20.4s, v3.4h, v1.h[0]\n"
+        "umlal v21.4s, v3.4h, v1.h[1]\n"
+        "umlal v22.4s, v3.4h, v1.h[2]\n"
+        "umlal v23.4s, v3.4h, v1.h[3]\n"
+        "umlal2 v16.4s, v3.8h, v0.h[4]\n"
+        "umlal2 v17.4s, v3.8h, v0.h[5]\n"
+        "umlal2 v18.4s, v3.8h, v0.h[6]\n"
+        "umlal2 v19.4s, v3.8h, v0.h[7]\n"
+        "umlal2 v20.4s, v3.8h, v1.h[4]\n"
+        "umlal2 v21.4s, v3.8h, v1.h[5]\n"
+        "umlal2 v22.4s, v3.8h, v1.h[6]\n"
+        "umlal2 v23.4s, v3.8h, v1.h[7]\n"
+        "ld1 {v3.8b}, [%[lhs_ptr]], #8\n"
+
+        "subs %w[depth], %w[depth], #2\n"
+
+        // Multiply-accumulate, bottom third
+        "umlal v24.4s, v4.4h, v0.h[0]\n"
+        "umlal v25.4s, v4.4h, v0.h[1]\n"
+        "umlal v26.4s, v4.4h, v0.h[2]\n"
+        "umlal v27.4s, v4.4h, v0.h[3]\n"
+        "umlal v28.4s, v4.4h, v1.h[0]\n"
+        "umlal v29.4s, v4.4h, v1.h[1]\n"
+        "umlal v30.4s, v4.4h, v1.h[2]\n"
+        "umlal v31.4s, v4.4h, v1.h[3]\n"
+        "umlal2 v24.4s, v4.8h, v0.h[4]\n"
+        "umlal2 v25.4s, v4.8h, v0.h[5]\n"
+        "umlal2 v26.4s, v4.8h, v0.h[6]\n"
+        "umlal2 v27.4s, v4.8h, v0.h[7]\n"
+        "umlal2 v28.4s, v4.8h, v1.h[4]\n"
+        "umlal2 v29.4s, v4.8h, v1.h[5]\n"
+        "umlal2 v30.4s, v4.8h, v1.h[6]\n"
+        "umlal2 v31.4s, v4.8h, v1.h[7]\n"
+        "ld1 {v4.8b}, [%[lhs_ptr]], #8\n"
+
+        "bne " GEMMLOWP_LABEL_LOOP "b\n"
+
+        GEMMLOWP_LABEL_AFTER_LOOP
+        ":\n"
+
+        // Expand Lhs/Rhs cells to 16 bit.
+        "uxtl v0.8h, v5.8b\n"
+        "uxtl v1.8h, v6.8b\n"
+        "uxtl v2.8h, v2.8b\n"
+        "uxtl v3.8h, v3.8b\n"
+        "uxtl v4.8h, v4.8b\n"
+
+        // Multiply-accumulate, level of depth 0
+        "umlal v8.4s, v2.4h, v0.h[0]\n"
+        "umlal v9.4s, v2.4h, v0.h[1]\n"
+        "umlal v10.4s, v2.4h, v0.h[2]\n"
+        "umlal v11.4s, v2.4h, v0.h[3]\n"
+        "umlal v12.4s, v2.4h, v1.h[0]\n"
+        "umlal v13.4s, v2.4h, v1.h[1]\n"
+        "umlal v14.4s, v2.4h, v1.h[2]\n"
+        "umlal v15.4s, v2.4h, v1.h[3]\n"
+        "umlal v16.4s, v3.4h, v0.h[0]\n"
+        "umlal v17.4s, v3.4h, v0.h[1]\n"
+        "umlal v18.4s, v3.4h, v0.h[2]\n"
+        "umlal v19.4s, v3.4h, v0.h[3]\n"
+        "umlal v20.4s, v3.4h, v1.h[0]\n"
+        "umlal v21.4s, v3.4h, v1.h[1]\n"
+        "umlal v22.4s, v3.4h, v1.h[2]\n"
+        "umlal v23.4s, v3.4h, v1.h[3]\n"
+        "umlal v24.4s, v4.4h, v0.h[0]\n"
+        "umlal v25.4s, v4.4h, v0.h[1]\n"
+        "umlal v26.4s, v4.4h, v0.h[2]\n"
+        "umlal v27.4s, v4.4h, v0.h[3]\n"
+        "umlal v28.4s, v4.4h, v1.h[0]\n"
+        "umlal v29.4s, v4.4h, v1.h[1]\n"
+        "umlal v30.4s, v4.4h, v1.h[2]\n"
+        "umlal v31.4s, v4.4h, v1.h[3]\n"
+
+        // Multiply-accumulate, level of depth 1
+        "umlal2 v8.4s, v2.8h, v0.h[4]\n"
+        "umlal2 v9.4s, v2.8h, v0.h[5]\n"
+        "umlal2 v10.4s, v2.8h, v0.h[6]\n"
+        "umlal2 v11.4s, v2.8h, v0.h[7]\n"
+        "umlal2 v12.4s, v2.8h, v1.h[4]\n"
+        "umlal2 v13.4s, v2.8h, v1.h[5]\n"
+        "umlal2 v14.4s, v2.8h, v1.h[6]\n"
+        "umlal2 v15.4s, v2.8h, v1.h[7]\n"
+        "umlal2 v16.4s, v3.8h, v0.h[4]\n"
+        "umlal2 v17.4s, v3.8h, v0.h[5]\n"
+        "umlal2 v18.4s, v3.8h, v0.h[6]\n"
+        "umlal2 v19.4s, v3.8h, v0.h[7]\n"
+        "umlal2 v20.4s, v3.8h, v1.h[4]\n"
+        "umlal2 v21.4s, v3.8h, v1.h[5]\n"
+        "umlal2 v22.4s, v3.8h, v1.h[6]\n"
+        "umlal2 v23.4s, v3.8h, v1.h[7]\n"
+        "umlal2 v24.4s, v4.8h, v0.h[4]\n"
+        "umlal2 v25.4s, v4.8h, v0.h[5]\n"
+        "umlal2 v26.4s, v4.8h, v0.h[6]\n"
+        "umlal2 v27.4s, v4.8h, v0.h[7]\n"
+        "umlal2 v28.4s, v4.8h, v1.h[4]\n"
+        "umlal2 v29.4s, v4.8h, v1.h[5]\n"
+        "umlal2 v30.4s, v4.8h, v1.h[6]\n"
+        "umlal2 v31.4s, v4.8h, v1.h[7]\n"
+
+        // Store accumulators
+        "mov x0, %[accum_ptr]\n"
+        "st1 {v8.16b}, [x0], #16\n"
+        "st1 {v16.16b}, [x0], #16\n"
+        "st1 {v24.16b}, [x0], #16\n"
+        "st1 {v9.16b}, [x0], #16\n"
+        "st1 {v17.16b}, [x0], #16\n"
+        "st1 {v25.16b}, [x0], #16\n"
+        "st1 {v10.16b}, [x0], #16\n"
+        "st1 {v18.16b}, [x0], #16\n"
+        "st1 {v26.16b}, [x0], #16\n"
+        "st1 {v11.16b}, [x0], #16\n"
+        "st1 {v19.16b}, [x0], #16\n"
+        "st1 {v27.16b}, [x0], #16\n"
+        "st1 {v12.16b}, [x0], #16\n"
+        "st1 {v20.16b}, [x0], #16\n"
+        "st1 {v28.16b}, [x0], #16\n"
+        "st1 {v13.16b}, [x0], #16\n"
+        "st1 {v21.16b}, [x0], #16\n"
+        "st1 {v29.16b}, [x0], #16\n"
+        "st1 {v14.16b}, [x0], #16\n"
+        "st1 {v22.16b}, [x0], #16\n"
+        "st1 {v30.16b}, [x0], #16\n"
+        "st1 {v15.16b}, [x0], #16\n"
+        "st1 {v23.16b}, [x0], #16\n"
+        "st1 {v31.16b}, [x0], #16\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+        "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",
+        "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27",
+        "v28", "v29", "v30", "v31");
+  }
+};
+
+// Faster kernel by ARM. Not expanding operands before multiplication.
+// Tuned for A57. Compare to
+// NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand
+struct NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A57 {
+  typedef std::uint8_t OperandType;
+  typedef std::uint32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<5, 16, CellOrder::WidthMajor>, 1>,
+      KernelSideFormat<CellFormat<4, 16, CellOrder::WidthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    static const int kLhsWidth = Format::Lhs::kWidth;
+    static const int kRhsWidth = Format::Rhs::kWidth;
+    AccumulatorType rowmajor_accumulator_buffer[kLhsWidth * kRhsWidth];
+    asm volatile(
+        // Clear aggregators
+        "dup v12.4s, wzr\n"
+        "dup v13.4s, wzr\n"
+        "dup v14.4s, wzr\n"
+        "dup v15.4s, wzr\n"
+        "dup v16.4s, wzr\n"
+        "dup v17.4s, wzr\n"
+        "dup v18.4s, wzr\n"
+        "dup v19.4s, wzr\n"
+        "dup v20.4s, wzr\n"
+        "dup v21.4s, wzr\n"
+        "dup v22.4s, wzr\n"
+        "dup v23.4s, wzr\n"
+        "dup v24.4s, wzr\n"
+        "dup v25.4s, wzr\n"
+        "dup v26.4s, wzr\n"
+        "dup v27.4s, wzr\n"
+        "dup v28.4s, wzr\n"
+        "dup v29.4s, wzr\n"
+        "dup v30.4s, wzr\n"
+        "dup v31.4s, wzr\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Overview of register layout:
+        //
+        // A 4x16 block of Rhs is stored in 8 bit in v0--v3.
+        // A 5x16 block of Lhs is cycled through v4 and v5 in 8 bit.
+        //
+        // A 4x5 block of aggregators is stored in v12-v31 (as 4x32 bit
+        // components which would need to be added at the end)
+        //
+        // The Lhs vectors are multiplied by the Rhs vectors with a widening
+        // multiply to produce an intermediate result which is stored in
+        // v6-v11.  Each intermediate result is 8x16 bits so this happens
+        // twice for each Lhs/Rhs combination (once with UMULL for elements
+        // 0-7 and once with UMULL2 for elements 8-15).
+        //
+        // UADALP is used to accumulate these intermediate results into the
+        // result aggregators.
+        //
+        //
+        //
+        //                               +--------+--------+--------+--------+
+        //                               |v0.b[0] |v1.b[0] |v2.b[0] |v3.b[0] |
+        //                          Rhs  +--------+--------+--------+--------+
+        //                               |  ...   |  ...   |  ...   |  ...   |
+        //                               +--------+--------+--------+--------|
+        //                               |v0.b[15]|v1.b[15]|v2.b[15]|v3.b[15]|
+        //                               +--------+--------+--------+--------+
+        //
+        //                               |        |        |        |        |
+        //
+        //    Lhs                        |        |        |        |        |
+        //
+        //  +-------+-----+--------+ - - +--------+--------+--------+--------+
+        //  |v4.b[0]| ... |v4.b[15]|     | v12.4s | v13.4s | v14.4s | v15.4s |
+        //  |v5.b[0]| ... |v5.b[15]|     | v16.4s | v17.4s | v18.4s | v19.4s |
+        //  |v4.b[0]| ... |v4.b[15]|     | v20.4s | v21.4s | v22.4s | v23.4s |
+        //  |v5.b[0]| ... |v5.b[15]|     | v24.4s | v25.4s | v26.4s | v27.4s |
+        //  |v4.b[0]| ... |v4.b[15]|     | v28.4s | v29.4s | v30.4s | v31.4s |
+        //  +-------+--------------+ - - +--------+--------+--------+--------+
+        //
+        //                                                Accumulator
+        //
+        //
+        // Further possible optimisations (not tried):
+        //   - Move early loads into previous iteration (see Float32_WithScalar
+        //   for example). - Unroll loop 2x to alternate more smoothly between
+        //   v4 and v5. - A different number of temporary registers might work
+        //   better. - Pairing umull with corresponding umull2 might allow
+        //   better
+        //     register loading (e.g. at the start of the loop)
+        //   - Interleaving umull{2} and uadalp even more aggressively might
+        //     help, (not sure about latency vs. dispatch rate).
+        //
+        //
+        // Start loading Rhs - further loads are interleaved amongst the
+        // multiplies for better dispatch on A57.
+        "ld1 {v0.16b}, [%[rhs_ptr]], #16\n"
+
+        // Load first Lhs vector - further loads are interleaved amongst the
+        // multiplies
+        "ld1 {v4.16b}, [%[lhs_ptr]], #16\n"
+
+        "umull    v6.8h,  v0.8b,  v4.8b\n"
+        "ld1 {v1.16b}, [%[rhs_ptr]], #16\n"  // 2nd RHS element
+        "umull    v7.8h,  v1.8b,  v4.8b\n"
+        "ld1 {v2.16b}, [%[rhs_ptr]], #16\n"  // 3rd RHS element
+        "umull    v8.8h,  v2.8b,  v4.8b\n"
+        "ld1 {v3.16b}, [%[rhs_ptr]], #16\n"  // 4th RHS element
+        "umull    v9.8h,  v3.8b,  v4.8b\n"
+        "umull2  v10.8h, v0.16b, v4.16b\n"
+        "umull2  v11.8h, v1.16b, v4.16b\n"
+        "ld1 {v5.16b}, [%[lhs_ptr]], #16\n"  // 2nd LHS element
+
+        "uadalp  v12.4s, v6.8h\n"
+        "umull2   v6.8h, v2.16b, v4.16b\n"
+        "uadalp  v13.4s, v7.8h\n"
+        "umull2   v7.8h, v3.16b, v4.16b\n"
+        "ld1 {v4.16b}, [%[lhs_ptr]], #16\n"  // 1st LHS element done - Reuse v4
+        // for 3rd LHS element
+        "uadalp  v14.4s, v8.8h\n"
+        "umull    v8.8h,  v0.8b,  v5.8b\n"
+        "uadalp  v15.4s, v9.8h\n"
+        "umull    v9.8h,  v1.8b,  v5.8b\n"
+        "uadalp  v12.4s, v10.8h\n"
+        "umull   v10.8h,  v2.8b,  v5.8b\n"
+        "uadalp  v13.4s, v11.8h\n"
+        "umull   v11.8h,  v3.8b,  v5.8b\n"
+
+        "uadalp  v14.4s, v6.8h\n"
+        "umull2   v6.8h, v0.16b, v5.16b\n"
+        "uadalp  v15.4s, v7.8h\n"
+        "umull2   v7.8h, v1.16b, v5.16b\n"
+        "uadalp  v16.4s, v8.8h\n"
+        "umull2   v8.8h, v2.16b, v5.16b\n"
+        "uadalp  v17.4s, v9.8h\n"
+        "umull2   v9.8h, v3.16b, v5.16b\n"
+        "ld1 {v5.16b}, [%[lhs_ptr]], #16\n"  // 2nd LHS element done - Reuse v5
+        // for 4th LHS element
+        "uadalp  v18.4s, v10.8h\n"
+        "umull   v10.8h,  v0.8b,  v4.8b\n"
+        "uadalp  v19.4s, v11.8h\n"
+        "umull   v11.8h,  v1.8b,  v4.8b\n"
+
+        "uadalp  v16.4s, v6.8h\n"
+        "umull    v6.8h,  v2.8b,  v4.8b\n"
+        "uadalp  v17.4s, v7.8h\n"
+        "umull    v7.8h,  v3.8b,  v4.8b\n"
+        "uadalp  v18.4s, v8.8h\n"
+        "umull2   v8.8h, v0.16b, v4.16b\n"
+        "uadalp  v19.4s, v9.8h\n"
+        "umull2   v9.8h, v1.16b, v4.16b\n"
+        "uadalp  v20.4s, v10.8h\n"
+        "umull2  v10.8h, v2.16b, v4.16b\n"
+        "uadalp  v21.4s, v11.8h\n"
+        "umull2  v11.8h, v3.16b, v4.16b\n"
+        "ld1 {v4.16b}, [%[lhs_ptr]], #16\n"  // 3rd LHS element done - Reuse v4
+        // for 5th LHS element
+
+        "uadalp v22.4s, v6.8h\n"
+        "umull    v6.8h,  v0.8b,  v5.8b\n"
+        "uadalp v23.4s, v7.8h\n"
+        "umull    v7.8h,  v1.8b,  v5.8b\n"
+        "uadalp v20.4s, v8.8h\n"
+        "umull    v8.8h,  v2.8b,  v5.8b\n"
+        "uadalp v21.4s, v9.8h\n"
+        "umull    v9.8h,  v3.8b,  v5.8b\n"
+        "uadalp v22.4s, v10.8h\n"
+        "umull2  v10.8h, v0.16b, v5.16b\n"
+        "uadalp v23.4s, v11.8h\n"
+        "umull2  v11.8h, v1.16b, v5.16b\n"
+
+        "uadalp v24.4s, v6.8h\n"
+        "umull2   v6.8h,  v2.16b, v5.16b\n"
+        "uadalp v25.4s, v7.8h\n"
+        "umull2   v7.8h,  v3.16b, v5.16b\n"
+        "uadalp v26.4s, v8.8h\n"
+        "umull    v8.8h,  v0.8b,  v4.8b\n"
+        "uadalp v27.4s, v9.8h\n"
+        "umull    v9.8h,  v1.8b,  v4.8b\n"
+        "uadalp v24.4s, v10.8h\n"
+        "umull   v10.8h,  v2.8b,  v4.8b\n"
+        "uadalp v25.4s, v11.8h\n"
+        "umull   v11.8h,  v3.8b,  v4.8b\n"
+
+        "uadalp v26.4s, v6.8h\n"
+        "umull2   v6.8h, v0.16b, v4.16b\n"
+        "uadalp v27.4s, v7.8h\n"
+        "umull2   v7.8h, v1.16b, v4.16b\n"
+        "uadalp v28.4s, v8.8h\n"
+        "umull2   v8.8h, v2.16b, v4.16b\n"
+        "uadalp v29.4s, v9.8h\n"
+        "umull2   v9.8h, v3.16b, v4.16b\n"
+        "uadalp v30.4s, v10.8h\n"
+        "uadalp v31.4s, v11.8h\n"
+
+        "uadalp v28.4s, v6.8h\n"
+        "uadalp v29.4s, v7.8h\n"
+        // Loop. Decrement loop index (depth) by 16, since we just handled
+        // 16 levels of depth.  Do this subs a bit before the end of the loop
+        // for better dispatch on A57.
+        "subs %w[depth], %w[depth], #16\n"
+        "uadalp v30.4s, v8.8h\n"
+        "uadalp v31.4s, v9.8h\n"
+
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Reduce aggregators horizontally
+        "addp v0.4s, v12.4s, v13.4s\n"
+        "addp v1.4s, v14.4s, v15.4s\n"
+        "addp v2.4s, v16.4s, v17.4s\n"
+        "addp v3.4s, v18.4s, v19.4s\n"
+        "addp v4.4s, v20.4s, v21.4s\n"
+        "addp v5.4s, v22.4s, v23.4s\n"
+        "addp v6.4s, v24.4s, v25.4s\n"
+        "addp v7.4s, v26.4s, v27.4s\n"
+        "addp v8.4s, v28.4s, v29.4s\n"
+        "addp v9.4s, v30.4s, v31.4s\n"
+
+        "addp v10.4s, v0.4s, v1.4s\n"
+        "addp v11.4s, v2.4s, v3.4s\n"
+        "addp v12.4s, v4.4s, v5.4s\n"
+        "addp v13.4s, v6.4s, v7.4s\n"
+        "addp v14.4s, v8.4s, v9.4s\n"
+
+        "mov x0, %[rowmajor_accumulator_buffer]\n"
+        "st1 {v10.16b}, [x0], #16\n"
+        "st1 {v11.16b}, [x0], #16\n"
+        "st1 {v12.16b}, [x0], #16\n"
+        "st1 {v13.16b}, [x0], #16\n"
+        "st1 {v14.16b}, [x0], #16\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [rowmajor_accumulator_buffer] "r"(rowmajor_accumulator_buffer)
+        :  // clobbers
+        "cc", "memory", "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+        "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",
+        "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27",
+        "v28", "v29", "v30", "v31");
+
+    // accumulate row-major accumulators into global (column-major) accumulators
+    for (int l = 0; l < kLhsWidth; l++) {
+      for (int r = 0; r < kRhsWidth; r++) {
+        accum_ptr[l + kLhsWidth * r] +=
+            rowmajor_accumulator_buffer[r + l * kRhsWidth];
+      }
+    }
+  }
+};
+
+// Fast kernel operating on int8 operands.
+// It is assumed that one of the two int8 operands only takes values
+// in [-127, 127], while the other may freely range in [-128, 127].
+// The issue with both operands taking the value -128 is that:
+// -128*-128 + -128*-128 == -32768 overflows int16.
+// Every other expression a*b + c*d, for any int8 a,b,c,d, fits in int16
+// range. That is the basic idea of this kernel.
+struct NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits {
+  typedef std::int8_t OperandType;
+  typedef std::int32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 16, CellOrder::WidthMajor>, 1>,
+      KernelSideFormat<CellFormat<4, 16, CellOrder::WidthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    std::size_t start_depth = 123;
+    std::size_t run_depth = depth;
+    std::size_t dst_col_stride = 4;
+    AccumulatorType* dst_ptr = accum_ptr;
+    asm volatile(
+        // Overview of register layout:
+        //
+        // A 4x16 block of Rhs is stored in 8 bit in v0--v3.
+        // A 4x16 block of Lhs is stored in 8 bit in v4--v7.
+        //
+        // A 4x4 block of accumulators is stored in v16-v31 (as 4x32 bit
+        // components which need to be horizontally-added at the end)
+        //
+        // The Lhs vectors are multiplied by the Rhs vectors with a widening
+        // multiply over the 8 first levels of depth, producing int16x8
+        // vectors of products for each position in the accumulator matrix.
+        // Here comes the special trick: since the operands are signed int8,
+        // their range being [ -2^7 , 2^7 ), their products are in range
+        // [ -2^14 , 2^14 - 1 ), meaning that we can add two such values
+        // without any risk of overflowing int16.
+        // We thus proceed with the 8 next levels of depth, multiplying
+        // again Lhs by Rhs, accumulating into this existing int16x8 vector.
+        //
+        // Only then, having processed 16 levels of depth, do we need to
+        // horizontally add these int16x8 accumulators into the final
+        // int32x4 accumulators.
+        //
+        // As we do not have enough registers to store all 16 int16x8
+        // temporary-16bit-accumulators, we have them cycle through v8--v15.
+        //
+        //
+        // Register layout (ignoring the v8--v15 temporary 16bit accumulators):
+        //
+        //                               +--------+--------+--------+--------+
+        //                               |v0.b[0] |v1.b[0] |v2.b[0] |v3.b[0] |
+        //                          Rhs  +--------+--------+--------+--------+
+        //                               |  ...   |  ...   |  ...   |  ...   |
+        //                               +--------+--------+--------+--------|
+        //                               |v0.b[15]|v1.b[15]|v2.b[15]|v3.b[15]|
+        //                               +--------+--------+--------+--------+
+        //
+        //                               |        |        |        |        |
+        //
+        //    Lhs                        |        |        |        |        |
+        //
+        //  +-------+-----+--------+ - - +--------+--------+--------+--------+
+        //  |v4.b[0]| ... |v4.b[15]|     | v16.4s | v17.4s | v18.4s | v19.4s |
+        //  |v5.b[0]| ... |v5.b[15]|     | v20.4s | v21.4s | v22.4s | v23.4s |
+        //  |v6.b[0]| ... |v6.b[15]|     | v24.4s | v25.4s | v26.4s | v27.4s |
+        //  |v7.b[0]| ... |v7.b[15]|     | v28.4s | v29.4s | v30.4s | v31.4s |
+        //  +-------+--------------+ - - +--------+--------+--------+--------+
+        //
+        //                                                Accumulator
+        //
+
+        // Clear accumulators
+        "ld1 {v0.16b}, [%[rhs_ptr]], #16\n"
+        "dup v16.4s, wzr\n"
+        "ld1 {v1.16b}, [%[rhs_ptr]], #16\n"
+        "dup v17.4s, wzr\n"
+        "ld1 {v4.16b}, [%[lhs_ptr]], #16\n"
+        "dup v18.4s, wzr\n"
+        "ld1 {v5.16b}, [%[lhs_ptr]], #16\n"
+        "dup v19.4s, wzr\n"
+        "ld1 {v6.16b}, [%[lhs_ptr]], #16\n"
+        "dup v20.4s, wzr\n"
+        "ld1 {v7.16b}, [%[lhs_ptr]], #16\n"
+        "dup v21.4s, wzr\n"
+        "ld1 {v2.16b}, [%[rhs_ptr]], #16\n"
+        "dup v22.4s, wzr\n"
+        "ld1 {v3.16b}, [%[rhs_ptr]], #16\n"
+        "dup v23.4s, wzr\n"
+        "subs %[run_depth], %[run_depth], #16\n"
+        "dup v24.4s, wzr\n"
+        "mov x0, %[dst_ptr]\n"
+        "dup v25.4s, wzr\n"
+        "dup v26.4s, wzr\n"
+        "dup v27.4s, wzr\n"
+        "dup v28.4s, wzr\n"
+        "dup v29.4s, wzr\n"
+        "dup v30.4s, wzr\n"
+        "dup v31.4s, wzr\n"
+
+        "smull    v12.8h,  v0.8b,  v4.8b\n"
+        "smull    v13.8h,  v1.8b,  v4.8b\n"
+        "smull    v14.8h,  v0.8b,  v5.8b\n"
+        "smull    v15.8h,  v1.8b,  v5.8b\n"
+        "smlal2   v12.8h,  v0.16b,  v4.16b\n"
+        "smlal2   v13.8h,  v1.16b,  v4.16b\n"
+        "smlal2   v14.8h,  v0.16b,  v5.16b\n"
+        "smlal2   v15.8h,  v1.16b,  v5.16b\n"
+
+        "beq " GEMMLOWP_LABEL_AFTER_LOOP "f\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        "subs %[run_depth], %[run_depth], #16\n"
+
+        "sadalp  v16.4s, v12.8h\n"
+        "smull    v12.8h,  v0.8b,  v6.8b\n"
+        "sadalp  v17.4s, v13.8h\n"
+        "smull    v13.8h,  v0.8b,  v7.8b\n"
+        "sadalp  v20.4s, v14.8h\n"
+        "smull    v14.8h,  v1.8b,  v6.8b\n"
+        "sadalp  v21.4s, v15.8h\n"
+        "smull    v15.8h,  v1.8b,  v7.8b\n"
+        "smlal2   v12.8h,  v0.16b,  v6.16b\n"
+        "smlal2   v13.8h,  v0.16b,  v7.16b\n"
+        "ld1 {v0.16b}, [%[rhs_ptr]], #16\n"
+        "smlal2   v14.8h,  v1.16b,  v6.16b\n"
+        "smlal2   v15.8h,  v1.16b,  v7.16b\n"
+        "ld1 {v1.16b}, [%[rhs_ptr]], #16\n"
+        "sadalp  v24.4s, v12.8h\n"
+        "smull    v12.8h,  v2.8b,  v4.8b\n"
+        "sadalp  v28.4s, v13.8h\n"
+        "smull    v13.8h,  v3.8b,  v4.8b\n"
+        "sadalp  v25.4s, v14.8h\n"
+        "smull    v14.8h,  v2.8b,  v5.8b\n"
+        "sadalp  v29.4s, v15.8h\n"
+        "smull    v15.8h,  v3.8b,  v5.8b\n"
+        "smlal2   v12.8h,  v2.16b,  v4.16b\n"
+        "smlal2   v13.8h,  v3.16b,  v4.16b\n"
+        "ld1 {v4.16b}, [%[lhs_ptr]], #16\n"
+        "smlal2   v14.8h,  v2.16b,  v5.16b\n"
+        "smlal2   v15.8h,  v3.16b,  v5.16b\n"
+        "ld1 {v5.16b}, [%[lhs_ptr]], #16\n"
+        "sadalp  v18.4s, v12.8h\n"
+        "smull    v12.8h,  v2.8b,  v6.8b\n"
+        "sadalp  v19.4s, v13.8h\n"
+        "smull    v13.8h,  v2.8b,  v7.8b\n"
+        "sadalp  v22.4s, v14.8h\n"
+        "smull    v14.8h,  v3.8b,  v6.8b\n"
+        "sadalp  v23.4s, v15.8h\n"
+        "smull    v15.8h,  v3.8b,  v7.8b\n"
+        "smlal2   v12.8h,  v2.16b,  v6.16b\n"
+        "smlal2   v13.8h,  v2.16b,  v7.16b\n"
+        "ld1 {v2.16b}, [%[rhs_ptr]], #16\n"
+        "smlal2   v14.8h,  v3.16b,  v6.16b\n"
+        "ld1 {v6.16b}, [%[lhs_ptr]], #16\n"
+        "smlal2   v15.8h,  v3.16b,  v7.16b\n"
+        "ld1 {v7.16b}, [%[lhs_ptr]], #16\n"
+        "sadalp  v26.4s, v12.8h\n"
+        "ld1 {v3.16b}, [%[rhs_ptr]], #16\n"
+        "smull    v12.8h,  v0.8b,  v4.8b\n"
+        "sadalp  v30.4s, v13.8h\n"
+        "smull    v13.8h,  v1.8b,  v4.8b\n"
+        "sadalp  v27.4s, v14.8h\n"
+        "smull    v14.8h,  v0.8b,  v5.8b\n"
+        "sadalp  v31.4s, v15.8h\n"
+        "smull    v15.8h,  v1.8b,  v5.8b\n"
+        "smlal2   v12.8h,  v0.16b,  v4.16b\n"
+        "smlal2   v13.8h,  v1.16b,  v4.16b\n"
+        "smlal2   v14.8h,  v0.16b,  v5.16b\n"
+        "smlal2   v15.8h,  v1.16b,  v5.16b\n"
+
+        "bne " GEMMLOWP_LABEL_LOOP "b\n"
+
+        GEMMLOWP_LABEL_AFTER_LOOP
+        ":\n"
+
+        // Load accumulators from memory
+        "ld1 {v8.16b}, [x0], #16\n"
+        "ld1 {v9.16b}, [x0], #16\n"
+        "ld1 {v10.16b}, [x0], #16\n"
+        "ld1 {v11.16b}, [x0], #16\n"
+        "mov x0, %[dst_ptr]\n"
+
+        // Do the remaining arithmetic for the 16 last levels of depths.
+        // All the operands are already loaded.
+        "sadalp  v16.4s, v12.8h\n"
+        "smull    v12.8h,  v0.8b,  v6.8b\n"
+        "sadalp  v17.4s, v13.8h\n"
+        "smull    v13.8h,  v0.8b,  v7.8b\n"
+        "sadalp  v20.4s, v14.8h\n"
+        "smull    v14.8h,  v1.8b,  v6.8b\n"
+        "sadalp  v21.4s, v15.8h\n"
+        "smull    v15.8h,  v1.8b,  v7.8b\n"
+        "smlal2   v12.8h,  v0.16b,  v6.16b\n"
+        "smlal2   v13.8h,  v0.16b,  v7.16b\n"
+        "smlal2   v14.8h,  v1.16b,  v6.16b\n"
+        "smlal2   v15.8h,  v1.16b,  v7.16b\n"
+        "sadalp  v24.4s, v12.8h\n"
+        "smull    v12.8h,  v2.8b,  v4.8b\n"
+        "sadalp  v28.4s, v13.8h\n"
+        "smull    v13.8h,  v3.8b,  v4.8b\n"
+        "sadalp  v25.4s, v14.8h\n"
+        "smull    v14.8h,  v2.8b,  v5.8b\n"
+        "sadalp  v29.4s, v15.8h\n"
+        "smull    v15.8h,  v3.8b,  v5.8b\n"
+        "smlal2   v12.8h,  v2.16b,  v4.16b\n"
+        "smlal2   v13.8h,  v3.16b,  v4.16b\n"
+        "smlal2   v14.8h,  v2.16b,  v5.16b\n"
+        "smlal2   v15.8h,  v3.16b,  v5.16b\n"
+        "sadalp  v18.4s, v12.8h\n"
+        "smull    v12.8h,  v2.8b,  v6.8b\n"
+        "sadalp  v19.4s, v13.8h\n"
+        "smull    v13.8h,  v2.8b,  v7.8b\n"
+        "sadalp  v22.4s, v14.8h\n"
+        "smull    v14.8h,  v3.8b,  v6.8b\n"
+        "sadalp  v23.4s, v15.8h\n"
+        "smull    v15.8h,  v3.8b,  v7.8b\n"
+        "smlal2   v12.8h,  v2.16b,  v6.16b\n"
+        "smlal2   v13.8h,  v2.16b,  v7.16b\n"
+        "smlal2   v14.8h,  v3.16b,  v6.16b\n"
+        "smlal2   v15.8h,  v3.16b,  v7.16b\n"
+        "sadalp  v26.4s, v12.8h\n"
+        "sadalp  v30.4s, v13.8h\n"
+        "sadalp  v27.4s, v14.8h\n"
+        "sadalp  v31.4s, v15.8h\n"
+
+        // Reduce aggregators horizontally
+        "addp v0.4s, v16.4s, v20.4s\n"
+        "addp v1.4s, v17.4s, v21.4s\n"
+        "addp v2.4s, v18.4s, v22.4s\n"
+        "addp v3.4s, v19.4s, v23.4s\n"
+        "addp v4.4s, v24.4s, v28.4s\n"
+        "addp v5.4s, v25.4s, v29.4s\n"
+        "addp v6.4s, v26.4s, v30.4s\n"
+        "addp v7.4s, v27.4s, v31.4s\n"
+
+        "addp v12.4s, v0.4s, v4.4s\n"
+        "addp v13.4s, v1.4s, v5.4s\n"
+        "addp v14.4s, v2.4s, v6.4s\n"
+        "addp v15.4s, v3.4s, v7.4s\n"
+
+        // Add to the accumulators loaded from memory
+        "add v8.4s, v8.4s, v12.4s\n"
+        "add v9.4s, v9.4s, v13.4s\n"
+        "add v10.4s, v10.4s, v14.4s\n"
+        "add v11.4s, v11.4s, v15.4s\n"
+
+        // Store accumulators back to memory
+        "st1 {v8.16b}, [x0], #16\n"
+        "st1 {v9.16b}, [x0], #16\n"
+        "st1 {v10.16b}, [x0], #16\n"
+        "st1 {v11.16b}, [x0], #16\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [dst_ptr] "+r"(dst_ptr), [run_depth] "+r"(run_depth),
+        [dst_col_stride] "+r"(dst_col_stride)
+        :  // inputs
+        [start_depth] "r"(start_depth)
+        :  // clobbers
+        "cc", "memory", "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+        "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",
+        "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27",
+        "v28", "v29", "v30", "v31");
+  }
+};
+
+// We don't actually use int32*int32 in production. This is just an
+// experiment to help dissociate the effect of integer-vs-float, from the
+// effect of operands width.
+struct NEON_64bit_GEMM_Int32_WithScalar {
+  typedef std::int32_t OperandType;
+  typedef std::int32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 2> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov x0, %[accum_ptr]\n"
+        "ld1 {v8.16b}, [x0], #16\n"
+        "ld1 {v16.16b}, [x0], #16\n"
+        "ld1 {v24.16b}, [x0], #16\n"
+        "ld1 {v9.16b}, [x0], #16\n"
+        "ld1 {v17.16b}, [x0], #16\n"
+        "ld1 {v25.16b}, [x0], #16\n"
+        "ld1 {v10.16b}, [x0], #16\n"
+        "ld1 {v18.16b}, [x0], #16\n"
+        "ld1 {v26.16b}, [x0], #16\n"
+        "ld1 {v11.16b}, [x0], #16\n"
+        "ld1 {v19.16b}, [x0], #16\n"
+        "ld1 {v27.16b}, [x0], #16\n"
+        "ld1 {v12.16b}, [x0], #16\n"
+        "ld1 {v20.16b}, [x0], #16\n"
+        "ld1 {v28.16b}, [x0], #16\n"
+        "ld1 {v13.16b}, [x0], #16\n"
+        "ld1 {v21.16b}, [x0], #16\n"
+        "ld1 {v29.16b}, [x0], #16\n"
+        "ld1 {v14.16b}, [x0], #16\n"
+        "ld1 {v22.16b}, [x0], #16\n"
+        "ld1 {v30.16b}, [x0], #16\n"
+        "ld1 {v15.16b}, [x0], #16\n"
+        "ld1 {v23.16b}, [x0], #16\n"
+        "ld1 {v31.16b}, [x0], #16\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Load 2 Rhs cell of size 1x4 each
+        "ld1 {v0.4s}, [%[rhs_ptr]], #16\n"
+        "ld1 {v1.4s}, [%[rhs_ptr]], #16\n"
+
+        // Load 3 Lhs cells of size 4x1 each
+        "ld1 {v2.4s}, [%[lhs_ptr]], #16\n"
+        "ld1 {v3.4s}, [%[lhs_ptr]], #16\n"
+        "ld1 {v4.4s}, [%[lhs_ptr]], #16\n"
+
+        // Multiply-accumulate
+        "mla v8.4s, v2.4s, v0.s[0]\n"
+        "mla v9.4s, v2.4s, v0.s[1]\n"
+        "mla v10.4s, v2.4s, v0.s[2]\n"
+        "mla v11.4s, v2.4s, v0.s[3]\n"
+        "mla v12.4s, v2.4s, v1.s[0]\n"
+        "mla v13.4s, v2.4s, v1.s[1]\n"
+        "mla v14.4s, v2.4s, v1.s[2]\n"
+        "mla v15.4s, v2.4s, v1.s[3]\n"
+        "mla v16.4s, v3.4s, v0.s[0]\n"
+        "mla v17.4s, v3.4s, v0.s[1]\n"
+        "mla v18.4s, v3.4s, v0.s[2]\n"
+        "mla v19.4s, v3.4s, v0.s[3]\n"
+        "mla v20.4s, v3.4s, v1.s[0]\n"
+        "mla v21.4s, v3.4s, v1.s[1]\n"
+        "mla v22.4s, v3.4s, v1.s[2]\n"
+        "mla v23.4s, v3.4s, v1.s[3]\n"
+        "mla v24.4s, v4.4s, v0.s[0]\n"
+        "mla v25.4s, v4.4s, v0.s[1]\n"
+        "mla v26.4s, v4.4s, v0.s[2]\n"
+        "mla v27.4s, v4.4s, v0.s[3]\n"
+        "mla v28.4s, v4.4s, v1.s[0]\n"
+        "mla v29.4s, v4.4s, v1.s[1]\n"
+        "mla v30.4s, v4.4s, v1.s[2]\n"
+        "mla v31.4s, v4.4s, v1.s[3]\n"
+
+        // Loop. Decrement loop index (depth) by 1, since we just handled 1
+        // level of depth.
+        "subs %w[depth], %w[depth], #1\n"
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov x0, %[accum_ptr]\n"
+        "st1 {v8.16b}, [x0], #16\n"
+        "st1 {v16.16b}, [x0], #16\n"
+        "st1 {v24.16b}, [x0], #16\n"
+        "st1 {v9.16b}, [x0], #16\n"
+        "st1 {v17.16b}, [x0], #16\n"
+        "st1 {v25.16b}, [x0], #16\n"
+        "st1 {v10.16b}, [x0], #16\n"
+        "st1 {v18.16b}, [x0], #16\n"
+        "st1 {v26.16b}, [x0], #16\n"
+        "st1 {v11.16b}, [x0], #16\n"
+        "st1 {v19.16b}, [x0], #16\n"
+        "st1 {v27.16b}, [x0], #16\n"
+        "st1 {v12.16b}, [x0], #16\n"
+        "st1 {v20.16b}, [x0], #16\n"
+        "st1 {v28.16b}, [x0], #16\n"
+        "st1 {v13.16b}, [x0], #16\n"
+        "st1 {v21.16b}, [x0], #16\n"
+        "st1 {v29.16b}, [x0], #16\n"
+        "st1 {v14.16b}, [x0], #16\n"
+        "st1 {v22.16b}, [x0], #16\n"
+        "st1 {v30.16b}, [x0], #16\n"
+        "st1 {v15.16b}, [x0], #16\n"
+        "st1 {v23.16b}, [x0], #16\n"
+        "st1 {v31.16b}, [x0], #16\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+        "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",
+        "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27",
+        "v28", "v29", "v30", "v31");
+  }
+};
+
+// Not very efficient kernel, just an experiment to see what we can do
+// without using NEON multiply-with-scalar instructions.
+struct NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 2> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov x0, %[accum_ptr]\n"
+        "ld1 {v8.16b}, [x0], #16\n"
+        "ld1 {v16.16b}, [x0], #16\n"
+        "ld1 {v24.16b}, [x0], #16\n"
+        "ld1 {v9.16b}, [x0], #16\n"
+        "ld1 {v17.16b}, [x0], #16\n"
+        "ld1 {v25.16b}, [x0], #16\n"
+        "ld1 {v10.16b}, [x0], #16\n"
+        "ld1 {v18.16b}, [x0], #16\n"
+        "ld1 {v26.16b}, [x0], #16\n"
+        "ld1 {v11.16b}, [x0], #16\n"
+        "ld1 {v19.16b}, [x0], #16\n"
+        "ld1 {v27.16b}, [x0], #16\n"
+        "ld1 {v12.16b}, [x0], #16\n"
+        "ld1 {v20.16b}, [x0], #16\n"
+        "ld1 {v28.16b}, [x0], #16\n"
+        "ld1 {v13.16b}, [x0], #16\n"
+        "ld1 {v21.16b}, [x0], #16\n"
+        "ld1 {v29.16b}, [x0], #16\n"
+        "ld1 {v14.16b}, [x0], #16\n"
+        "ld1 {v22.16b}, [x0], #16\n"
+        "ld1 {v30.16b}, [x0], #16\n"
+        "ld1 {v15.16b}, [x0], #16\n"
+        "ld1 {v23.16b}, [x0], #16\n"
+        "ld1 {v31.16b}, [x0], #16\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Load 2 Rhs cell of size 1x4 each
+        "ld1 {v5.4s}, [%[rhs_ptr]], #16\n"
+        "ld1 {v6.4s}, [%[rhs_ptr]], #16\n"
+
+        // Load 3 Lhs cells of size 4x1 each
+        "ld1 {v2.4s}, [%[lhs_ptr]], #16\n"
+        "ld1 {v3.4s}, [%[lhs_ptr]], #16\n"
+        "ld1 {v4.4s}, [%[lhs_ptr]], #16\n"
+
+        // Multiply-accumulate
+        "dup v0.4s, v5.s[0]\n"
+        "dup v1.4s, v5.s[1]\n"
+        "fmla v8.4s, v2.4s, v0.4s\n"
+        "fmla v16.4s, v3.4s, v0.4s\n"
+        "fmla v24.4s, v4.4s, v0.4s\n"
+        "fmla v9.4s, v2.4s, v1.4s\n"
+        "fmla v17.4s, v3.4s, v1.4s\n"
+        "fmla v25.4s, v4.4s, v1.4s\n"
+        "dup v0.4s, v5.s[2]\n"
+        "dup v1.4s, v5.s[3]\n"
+        "fmla v10.4s, v2.4s, v0.4s\n"
+        "fmla v18.4s, v3.4s, v0.4s\n"
+        "fmla v26.4s, v4.4s, v0.4s\n"
+        "fmla v11.4s, v2.4s, v1.4s\n"
+        "fmla v19.4s, v3.4s, v1.4s\n"
+        "fmla v27.4s, v4.4s, v1.4s\n"
+        "dup v0.4s, v6.s[0]\n"
+        "dup v1.4s, v6.s[1]\n"
+        "fmla v12.4s, v2.4s, v0.4s\n"
+        "fmla v20.4s, v3.4s, v0.4s\n"
+        "fmla v28.4s, v4.4s, v0.4s\n"
+        "fmla v13.4s, v2.4s, v1.4s\n"
+        "fmla v21.4s, v3.4s, v1.4s\n"
+        "fmla v29.4s, v4.4s, v1.4s\n"
+        "dup v0.4s, v6.s[2]\n"
+        "dup v1.4s, v6.s[3]\n"
+        "fmla v14.4s, v2.4s, v0.4s\n"
+        "fmla v22.4s, v3.4s, v0.4s\n"
+        "fmla v30.4s, v4.4s, v0.4s\n"
+        "fmla v15.4s, v2.4s, v1.4s\n"
+        "fmla v23.4s, v3.4s, v1.4s\n"
+        "fmla v31.4s, v4.4s, v1.4s\n"
+
+        // Loop. Decrement loop index (depth) by 1, since we just handled 1
+        // level of depth.
+        "subs %w[depth], %w[depth], #1\n"
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov x0, %[accum_ptr]\n"
+        "st1 {v8.16b}, [x0], #16\n"
+        "st1 {v16.16b}, [x0], #16\n"
+        "st1 {v24.16b}, [x0], #16\n"
+        "st1 {v9.16b}, [x0], #16\n"
+        "st1 {v17.16b}, [x0], #16\n"
+        "st1 {v25.16b}, [x0], #16\n"
+        "st1 {v10.16b}, [x0], #16\n"
+        "st1 {v18.16b}, [x0], #16\n"
+        "st1 {v26.16b}, [x0], #16\n"
+        "st1 {v11.16b}, [x0], #16\n"
+        "st1 {v19.16b}, [x0], #16\n"
+        "st1 {v27.16b}, [x0], #16\n"
+        "st1 {v12.16b}, [x0], #16\n"
+        "st1 {v20.16b}, [x0], #16\n"
+        "st1 {v28.16b}, [x0], #16\n"
+        "st1 {v13.16b}, [x0], #16\n"
+        "st1 {v21.16b}, [x0], #16\n"
+        "st1 {v29.16b}, [x0], #16\n"
+        "st1 {v14.16b}, [x0], #16\n"
+        "st1 {v22.16b}, [x0], #16\n"
+        "st1 {v30.16b}, [x0], #16\n"
+        "st1 {v15.16b}, [x0], #16\n"
+        "st1 {v23.16b}, [x0], #16\n"
+        "st1 {v31.16b}, [x0], #16\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+        "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",
+        "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27",
+        "v28", "v29", "v30", "v31");
+  }
+};
+
+// This is the "most natural" kernel, using NEON multiply-with-scalar
+// instructions.
+struct NEON_64bit_GEMM_Float32_WithScalar {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 2> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov x0, %[accum_ptr]\n"
+        "ld1 {v8.16b}, [x0], #16\n"
+        "ld1 {v16.16b}, [x0], #16\n"
+        "ld1 {v24.16b}, [x0], #16\n"
+        "ld1 {v9.16b}, [x0], #16\n"
+        "ld1 {v17.16b}, [x0], #16\n"
+        "ld1 {v25.16b}, [x0], #16\n"
+        "ld1 {v10.16b}, [x0], #16\n"
+        "ld1 {v18.16b}, [x0], #16\n"
+        "ld1 {v26.16b}, [x0], #16\n"
+        "ld1 {v11.16b}, [x0], #16\n"
+        "ld1 {v19.16b}, [x0], #16\n"
+        "ld1 {v27.16b}, [x0], #16\n"
+        "ld1 {v12.16b}, [x0], #16\n"
+        "ld1 {v20.16b}, [x0], #16\n"
+        "ld1 {v28.16b}, [x0], #16\n"
+        "ld1 {v13.16b}, [x0], #16\n"
+        "ld1 {v21.16b}, [x0], #16\n"
+        "ld1 {v29.16b}, [x0], #16\n"
+        "ld1 {v14.16b}, [x0], #16\n"
+        "ld1 {v22.16b}, [x0], #16\n"
+        "ld1 {v30.16b}, [x0], #16\n"
+        "ld1 {v15.16b}, [x0], #16\n"
+        "ld1 {v23.16b}, [x0], #16\n"
+        "ld1 {v31.16b}, [x0], #16\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Load 2 Rhs cell of size 1x4 each
+        "ld1 {v0.4s}, [%[rhs_ptr]], #16\n"
+        "ld1 {v1.4s}, [%[rhs_ptr]], #16\n"
+
+        // Load 3 Lhs cells of size 4x1 each
+        "ld1 {v2.4s}, [%[lhs_ptr]], #16\n"
+        "ld1 {v3.4s}, [%[lhs_ptr]], #16\n"
+        "ld1 {v4.4s}, [%[lhs_ptr]], #16\n"
+
+        // Multiply-accumulate
+        "fmla v8.4s, v2.4s, v0.s[0]\n"
+        "fmla v9.4s, v2.4s, v0.s[1]\n"
+        "fmla v10.4s, v2.4s, v0.s[2]\n"
+        "fmla v11.4s, v2.4s, v0.s[3]\n"
+        "fmla v12.4s, v2.4s, v1.s[0]\n"
+        "fmla v13.4s, v2.4s, v1.s[1]\n"
+        "fmla v14.4s, v2.4s, v1.s[2]\n"
+        "fmla v15.4s, v2.4s, v1.s[3]\n"
+        "fmla v16.4s, v3.4s, v0.s[0]\n"
+        "fmla v17.4s, v3.4s, v0.s[1]\n"
+        "fmla v18.4s, v3.4s, v0.s[2]\n"
+        "fmla v19.4s, v3.4s, v0.s[3]\n"
+        "fmla v20.4s, v3.4s, v1.s[0]\n"
+        "fmla v21.4s, v3.4s, v1.s[1]\n"
+        "fmla v22.4s, v3.4s, v1.s[2]\n"
+        "fmla v23.4s, v3.4s, v1.s[3]\n"
+        "fmla v24.4s, v4.4s, v0.s[0]\n"
+        "fmla v25.4s, v4.4s, v0.s[1]\n"
+        "fmla v26.4s, v4.4s, v0.s[2]\n"
+        "fmla v27.4s, v4.4s, v0.s[3]\n"
+        "fmla v28.4s, v4.4s, v1.s[0]\n"
+        "fmla v29.4s, v4.4s, v1.s[1]\n"
+        "fmla v30.4s, v4.4s, v1.s[2]\n"
+        "fmla v31.4s, v4.4s, v1.s[3]\n"
+
+        // Loop. Decrement loop index (depth) by 1, since we just handled 1
+        // level of depth.
+        "subs %w[depth], %w[depth], #1\n"
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov x0, %[accum_ptr]\n"
+        "st1 {v8.16b}, [x0], #16\n"
+        "st1 {v16.16b}, [x0], #16\n"
+        "st1 {v24.16b}, [x0], #16\n"
+        "st1 {v9.16b}, [x0], #16\n"
+        "st1 {v17.16b}, [x0], #16\n"
+        "st1 {v25.16b}, [x0], #16\n"
+        "st1 {v10.16b}, [x0], #16\n"
+        "st1 {v18.16b}, [x0], #16\n"
+        "st1 {v26.16b}, [x0], #16\n"
+        "st1 {v11.16b}, [x0], #16\n"
+        "st1 {v19.16b}, [x0], #16\n"
+        "st1 {v27.16b}, [x0], #16\n"
+        "st1 {v12.16b}, [x0], #16\n"
+        "st1 {v20.16b}, [x0], #16\n"
+        "st1 {v28.16b}, [x0], #16\n"
+        "st1 {v13.16b}, [x0], #16\n"
+        "st1 {v21.16b}, [x0], #16\n"
+        "st1 {v29.16b}, [x0], #16\n"
+        "st1 {v14.16b}, [x0], #16\n"
+        "st1 {v22.16b}, [x0], #16\n"
+        "st1 {v30.16b}, [x0], #16\n"
+        "st1 {v15.16b}, [x0], #16\n"
+        "st1 {v23.16b}, [x0], #16\n"
+        "st1 {v31.16b}, [x0], #16\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+        "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",
+        "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27",
+        "v28", "v29", "v30", "v31");
+  }
+};
+
+// Faster kernel contributed by ARM. Tuned for A57.
+struct NEON_64bit_GEMM_Float32_WithScalar_A57 {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 2> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov x0, %[accum_ptr]\n"
+        "ld1 {v8.16b}, [x0], #16\n"
+        "ld1 {v16.16b}, [x0], #16\n"
+        "ld1 {v24.16b}, [x0], #16\n"
+        "ld1 {v9.16b}, [x0], #16\n"
+        "ld1 {v17.16b}, [x0], #16\n"
+        "ld1 {v25.16b}, [x0], #16\n"
+        "ld1 {v10.16b}, [x0], #16\n"
+        "ld1 {v18.16b}, [x0], #16\n"
+        "ld1 {v26.16b}, [x0], #16\n"
+        "ld1 {v11.16b}, [x0], #16\n"
+        "ld1 {v19.16b}, [x0], #16\n"
+        "ld1 {v27.16b}, [x0], #16\n"
+        "ld1 {v12.16b}, [x0], #16\n"
+        "ld1 {v20.16b}, [x0], #16\n"
+        "ld1 {v28.16b}, [x0], #16\n"
+        "ld1 {v13.16b}, [x0], #16\n"
+        "ld1 {v21.16b}, [x0], #16\n"
+        "ld1 {v29.16b}, [x0], #16\n"
+        "ld1 {v14.16b}, [x0], #16\n"
+        "ld1 {v22.16b}, [x0], #16\n"
+        "ld1 {v30.16b}, [x0], #16\n"
+        "ld1 {v15.16b}, [x0], #16\n"
+        "ld1 {v23.16b}, [x0], #16\n"
+        "ld1 {v31.16b}, [x0], #16\n"
+
+        // The start of the loop assumes first Rhs cell is already loaded, so
+        // do it here for first iteration.
+        "ld1 {v0.4s}, [%[rhs_ptr]], #16\n"
+
+        // And the same for the first Lhs cell.
+        "ld1 {v2.4s}, [%[lhs_ptr]], #16\n"
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // Start the MACs at the head of the loop - 1st cell from each side
+        // already loaded.
+        "fmla v8.4s, v2.4s, v0.s[0]\n"
+        "fmla v9.4s, v2.4s, v0.s[1]\n"
+        "ld1 {v1.4s}, [%[rhs_ptr]], #16\n"  // Load second Rhs cell.
+        "fmla v10.4s, v2.4s, v0.s[2]\n"
+        "fmla v11.4s, v2.4s, v0.s[3]\n"
+        "ld1 {v3.4s}, [%[lhs_ptr]], #16\n"  // Load second Lhs cell.
+        "fmla v12.4s, v2.4s, v1.s[0]\n"
+        "fmla v13.4s, v2.4s, v1.s[1]\n"
+        "ld1 {v4.4s}, [%[lhs_ptr]], #16\n"  // Load third Lhs cell.
+        "fmla v14.4s, v2.4s, v1.s[2]\n"
+        "fmla v15.4s, v2.4s, v1.s[3]\n"
+        "ld1 {v2.4s}, [%[lhs_ptr]], #16\n"  // Done with first Lhs cell - load
+        // for the next iteration early.
+        "fmla v16.4s, v3.4s, v0.s[0]\n"
+        "fmla v17.4s, v3.4s, v0.s[1]\n"
+        "fmla v18.4s, v3.4s, v0.s[2]\n"
+        "fmla v19.4s, v3.4s, v0.s[3]\n"
+        "fmla v20.4s, v3.4s, v1.s[0]\n"
+        "fmla v21.4s, v3.4s, v1.s[1]\n"
+        "fmla v22.4s, v3.4s, v1.s[2]\n"
+        "fmla v23.4s, v3.4s, v1.s[3]\n"
+        "fmla v24.4s, v4.4s, v0.s[0]\n"
+        "fmla v25.4s, v4.4s, v0.s[1]\n"
+        "fmla v26.4s, v4.4s, v0.s[2]\n"
+        "fmla v27.4s, v4.4s, v0.s[3]\n"
+        "ld1 {v0.4s}, [%[rhs_ptr]], #16\n"  // Done with the first Rhs cell -
+        // load for the next iteration
+        // early.
+        "fmla v28.4s, v4.4s, v1.s[0]\n"
+        "fmla v29.4s, v4.4s, v1.s[1]\n"
+        // Loop. Decrement loop index (depth) by 1, since we just handled
+        // 1 level of depth.  Do this a bit before the end of the loop for
+        // better dispatch on A57.
+        "subs %w[depth], %w[depth], #1\n"
+        "fmla v30.4s, v4.4s, v1.s[2]\n"
+        "fmla v31.4s, v4.4s, v1.s[3]\n"
+
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov x0, %[accum_ptr]\n"
+        "st1 {v8.16b}, [x0], #16\n"
+        "st1 {v16.16b}, [x0], #16\n"
+        "st1 {v24.16b}, [x0], #16\n"
+        "st1 {v9.16b}, [x0], #16\n"
+        "st1 {v17.16b}, [x0], #16\n"
+        "st1 {v25.16b}, [x0], #16\n"
+        "st1 {v10.16b}, [x0], #16\n"
+        "st1 {v18.16b}, [x0], #16\n"
+        "st1 {v26.16b}, [x0], #16\n"
+        "st1 {v11.16b}, [x0], #16\n"
+        "st1 {v19.16b}, [x0], #16\n"
+        "st1 {v27.16b}, [x0], #16\n"
+        "st1 {v12.16b}, [x0], #16\n"
+        "st1 {v20.16b}, [x0], #16\n"
+        "st1 {v28.16b}, [x0], #16\n"
+        "st1 {v13.16b}, [x0], #16\n"
+        "st1 {v21.16b}, [x0], #16\n"
+        "st1 {v29.16b}, [x0], #16\n"
+        "st1 {v14.16b}, [x0], #16\n"
+        "st1 {v22.16b}, [x0], #16\n"
+        "st1 {v30.16b}, [x0], #16\n"
+        "st1 {v15.16b}, [x0], #16\n"
+        "st1 {v23.16b}, [x0], #16\n"
+        "st1 {v31.16b}, [x0], #16\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "x0", "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+        "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16", "v17",
+        "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26", "v27",
+        "v28", "v29", "v30", "v31");
+  }
+};
+
+#ifndef __APPLE__
+// Faster kernel contributed by ARM. Tuned for A53.
+struct NEON_64bit_GEMM_Float32_WithScalar_A53 {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 2> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    asm volatile(
+        // Load accumulators
+        "mov x0, %[accum_ptr]\n"
+        "ld1 {v8.16b}, [x0], #16\n"
+        "ld1 {v16.16b}, [x0], #16\n"
+        "ld1 {v24.16b}, [x0], #16\n"
+        "ld1 {v9.16b}, [x0], #16\n"
+        "ld1 {v17.16b}, [x0], #16\n"
+        "ld1 {v25.16b}, [x0], #16\n"
+        "ld1 {v10.16b}, [x0], #16\n"
+        "ld1 {v18.16b}, [x0], #16\n"
+        "ld1 {v26.16b}, [x0], #16\n"
+        "ld1 {v11.16b}, [x0], #16\n"
+        "ld1 {v19.16b}, [x0], #16\n"
+        "ld1 {v27.16b}, [x0], #16\n"
+        "ld1 {v12.16b}, [x0], #16\n"
+        "ld1 {v20.16b}, [x0], #16\n"
+        "ld1 {v28.16b}, [x0], #16\n"
+        "ld1 {v13.16b}, [x0], #16\n"
+        "ld1 {v21.16b}, [x0], #16\n"
+        "ld1 {v29.16b}, [x0], #16\n"
+        "ld1 {v14.16b}, [x0], #16\n"
+        "ld1 {v22.16b}, [x0], #16\n"
+        "ld1 {v30.16b}, [x0], #16\n"
+        "ld1 {v15.16b}, [x0], #16\n"
+        "ld1 {v23.16b}, [x0], #16\n"
+        "ld1 {v31.16b}, [x0], #16\n"
+
+        // For A53, a very different-looking loop is needed.
+        //
+        // The main reason for this is that on A53 128-bit loads take two
+        // cycles during which no dual issue can occur.  Doing two separate
+        // 64-bit loads avoids this issue - they each take one cycle and are
+        // able to dual issue.  Since vector register loads don't dual issue
+        // with FMLA, we load half the register as normal and the other half
+        // into an integer register.  This second half can then be moved into
+        // place later with an INS instruction - which will dual issue with a
+        // later FP load.
+        //
+        // For this kernel there are approximately 3 times as many multiplies
+        // as loads, so it makes sense to structure the loop into blocks of 4
+        // cycles, with 1 dedicated "load cycle" and 3 "multiply cycles" per
+        // block.  Strictly preserving this structure with NOPs where no load
+        // is needed seems to result in higher performance.
+        //
+        // Choice of x18 to store the upper halves on their way into the
+        // vector registers is arbitrary.  Added to the clobber list so that
+        // the compiler will make it available.
+        //
+        //
+        // At the start of the loop, it is assumed that v0 is "half loaded" -
+        // bottom half in place in d0 and the upper half in x18 ready to
+        // insert.  So set that up here for the first iteration:
+        "ldr d0, [%[rhs_ptr]]\n"             // Bottom half of first Rhs cell
+        "ldr x18, [%[rhs_ptr], #8]\n"        // Upper half
+        "add %[rhs_ptr], %[rhs_ptr], #16\n"  // Separate increment (needed as
+        // there is no operation to load at
+        // reg + 8 but then increment reg
+        // by 16).
+
+        // v2 should be fully loaded - as it's outside the loop proper it's fine
+        // to use a 128-bit load here.
+        "ld1 {v2.4s}, [%[lhs_ptr]], #16\n"  // first Lhs cell
+
+        GEMMLOWP_LABEL_LOOP
+        ":\n"
+
+        // First block of four cycles.  Multplies all require v2 and v0; v2 is
+        // loaded earlier and v0 is half loaded and completed in the load
+        // cycle at the start.
+        "ldr d1, [%[rhs_ptr]]\n"  // "load" cycle - loading bottom half of v1
+        // (second Rhs cell).
+        "ins v0.d[1], x18\n"  // "load" cycle - moving the upper half of v0 into
+        // place.
+        "fmla v8.4s, v2.4s, v0.s[0]\n"  // "fmla" cycle 1 - first multiply.
+        "ldr x18, [%[rhs_ptr], #8]\n"  // "fmla" cycle 1 - load upper half of v1
+        // into x18.
+        "fmla v9.4s, v2.4s, v0.s[1]\n"       // "fmla" cycle 2 - second multiply
+        "add %[rhs_ptr], %[rhs_ptr], #16\n"  // "fmla" cycle 2 - increment Rhs
+        // pointer (if needed)
+        "fmla v10.4s, v2.4s, v0.s[2]\n"  // "fmla" cycle 3 - third multiply.  No
+        // more work to dual issue.
+
+        // Second block.  Start loading v3 (second Lhs cell), finish loading v1.
+        "ldr d3, [%[lhs_ptr]]\n"
+        "ins v1.d[1], x18\n"  // v1 ready here.
+        "fmla v11.4s, v2.4s, v0.s[3]\n"
+        "ldr x18, [%[lhs_ptr], #8]\n"
+        "fmla v12.4s, v2.4s, v1.s[0]\n"  // First use of v1.
+        "add %[lhs_ptr], %[lhs_ptr], #16\n"
+        "fmla v13.4s, v2.4s, v1.s[1]\n"
+
+        // Third block.  Start loading v4 (third Lhs cell), finish loading v3.
+        "ldr d4, [%[lhs_ptr]]\n"
+        "ins v3.d[1], x18\n"  // v3 ready here.
+        "fmla v14.4s, v2.4s, v1.s[2]\n"
+        "ldr x18, [%[lhs_ptr], #8]\n"
+        "fmla v15.4s, v2.4s, v1.s[3]\n"
+        "add %[lhs_ptr], %[lhs_ptr], #16\n"
+        "fmla v16.4s, v3.4s, v0.s[0]\n"  // First use of v3.
+
+        // Fourth block.  v2 (first Lhs cell) is now finished with, so start
+        // loading value for next iteration.  Finish loading v4.
+        "ldr d2, [%[lhs_ptr]]\n"
+        "ins v4.d[1], x18\n"  // v4 ready here.
+        "fmla v17.4s, v3.4s, v0.s[1]\n"
+        "ldr x18, [%[lhs_ptr], #8]\n"
+        "fmla v18.4s, v3.4s, v0.s[2]\n"
+        "add %[lhs_ptr], %[lhs_ptr], #16\n"
+        "fmla v19.4s, v3.4s, v0.s[3]\n"
+
+        // Fifth block, finish loading v2.  No new load to start as the other
+        // registers are all still live.
+        "ins v2.d[1], x18\n"
+        "fmla v20.4s, v3.4s, v1.s[0]\n"
+        "fmla v21.4s, v3.4s, v1.s[1]\n"
+        "fmla v22.4s, v3.4s, v1.s[2]\n"
+
+        // Sixth block, nothing to load.  2 nops needed as a single nop would
+        // dual issue with the FMLA and break the timing.
+        "nop\n"
+        "nop\n"
+        "fmla v23.4s, v3.4s, v1.s[3]\n"
+        "fmla v24.4s, v4.4s, v0.s[0]\n"  // First use of v4.
+        "fmla v25.4s, v4.4s, v0.s[1]\n"
+
+        // Seventh block, nothing to load.  Decrement the loop counter in this
+        // block as the last block is very full.
+        "nop\n"
+        "nop\n"
+        "fmla v26.4s, v4.4s, v0.s[2]\n"
+        "subs %w[depth], %w[depth], #1\n"
+        "fmla v27.4s, v4.4s, v0.s[3]\n"
+        "fmla v28.4s, v4.4s, v1.s[0]\n"
+
+        // Eighth block - start loading v0 for next iteration.
+        "ldr d0, [%[rhs_ptr]]\n"
+        "fmla v29.4s, v4.4s, v1.s[1]\n"
+        "ldr x18, [%[rhs_ptr], #8]\n"
+        "fmla v30.4s, v4.4s, v1.s[2]\n"
+        "add %[rhs_ptr], %[rhs_ptr], #16\n"
+        "fmla v31.4s, v4.4s, v1.s[3]\n"
+
+        // Loop branch.  This will dual issue in fmla cycle 3 of the 8th block.
+        "bne " GEMMLOWP_LABEL_LOOP
+        "b\n"
+
+        // Store accumulators
+        "mov x0, %[accum_ptr]\n"
+        "st1 {v8.16b}, [x0], #16\n"
+        "st1 {v16.16b}, [x0], #16\n"
+        "st1 {v24.16b}, [x0], #16\n"
+        "st1 {v9.16b}, [x0], #16\n"
+        "st1 {v17.16b}, [x0], #16\n"
+        "st1 {v25.16b}, [x0], #16\n"
+        "st1 {v10.16b}, [x0], #16\n"
+        "st1 {v18.16b}, [x0], #16\n"
+        "st1 {v26.16b}, [x0], #16\n"
+        "st1 {v11.16b}, [x0], #16\n"
+        "st1 {v19.16b}, [x0], #16\n"
+        "st1 {v27.16b}, [x0], #16\n"
+        "st1 {v12.16b}, [x0], #16\n"
+        "st1 {v20.16b}, [x0], #16\n"
+        "st1 {v28.16b}, [x0], #16\n"
+        "st1 {v13.16b}, [x0], #16\n"
+        "st1 {v21.16b}, [x0], #16\n"
+        "st1 {v29.16b}, [x0], #16\n"
+        "st1 {v14.16b}, [x0], #16\n"
+        "st1 {v22.16b}, [x0], #16\n"
+        "st1 {v30.16b}, [x0], #16\n"
+        "st1 {v15.16b}, [x0], #16\n"
+        "st1 {v23.16b}, [x0], #16\n"
+        "st1 {v31.16b}, [x0], #16\n"
+        :  // outputs
+        [lhs_ptr] "+r"(lhs_ptr), [rhs_ptr] "+r"(rhs_ptr),
+        [depth] "+r"(depth)
+        :  // inputs
+        [accum_ptr] "r"(accum_ptr)
+        :  // clobbers
+        "cc", "memory", "x0", "x18", "v0", "v1", "v2", "v3", "v4", "v5", "v6",
+        "v7", "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15", "v16",
+        "v17", "v18", "v19", "v20", "v21", "v22", "v23", "v24", "v25", "v26",
+        "v27", "v28", "v29", "v30", "v31");
+  }
+};
+#endif
+
+#endif  // __aarch64__
+
+#ifndef __aarch64__
+inline int32x4_t vpaddq_s32(int32x4_t a, int32x4_t b) {
+  const int32x2_t c = vpadd_s32(vget_low_s32(a), vget_high_s32(a));
+  const int32x2_t d = vpadd_s32(vget_low_s32(b), vget_high_s32(b));
+  return vcombine_s32(c, d);
+}
+#endif
+
+// C++ intrinsics-based variant of the deep, int8, fast kernel
+template <int Cols>
+struct NEON_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics {
+  typedef std::int8_t OperandType;
+  typedef std::int32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 16, CellOrder::WidthMajor>, 1>,
+      KernelSideFormat<CellFormat<Cols, 16, CellOrder::WidthMajor>, 1> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    int32x4_t acc[4][Cols];
+    for (int i = 0; i < 4; i++) {
+      for (int j = 0; j < Cols; j++) {
+        acc[i][j] = vdupq_n_s32(0);
+      }
+    }
+    for (int d = 0; d < depth; d += 16) {
+      int8x16_t lhs[4];
+      for (int i = 0; i < 4; i++) {
+        lhs[i] = vld1q_s8(lhs_ptr + 16 * i);
+      }
+      int8x16_t rhs[Cols];
+      for (int i = 0; i < Cols; i++) {
+        rhs[i] = vld1q_s8(rhs_ptr + 16 * i);
+      }
+      for (int i = 0; i < 4; i++) {
+        for (int j = 0; j < Cols; j++) {
+          int16x8_t local_acc =
+              vmull_s8(vget_low_s8(lhs[i]), vget_low_s8(rhs[j]));
+          local_acc =
+              vmlal_s8(local_acc, vget_high_s8(lhs[i]), vget_high_s8(rhs[j]));
+          acc[i][j] = vpadalq_s16(acc[i][j], local_acc);
+        }
+      }
+      lhs_ptr += 64;
+      rhs_ptr += 16 * Cols;
+    }
+    for (int i = 0; i < Cols; i++) {
+      int32x4_t acc_2x_0 = vpaddq_s32(acc[0][i], acc[1][i]);
+      int32x4_t acc_2x_1 = vpaddq_s32(acc[2][i], acc[3][i]);
+      int32x4_t acc_4x = vpaddq_s32(acc_2x_0, acc_2x_1);
+      int32x4_t dst_val = vld1q_s32(accum_ptr + 4 * i);
+      dst_val = vaddq_s32(dst_val, acc_4x);
+      vst1q_s32(accum_ptr + 4 * i, dst_val);
+    }
+  }
+};
+
+using NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics =
+    NEON_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics<4>;
+
+using NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics =
+    NEON_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics<2>;
+
+// C++ intrinsics-based variant of the wide, uint8, general kernel
+template <int RhsCells>
+struct NEON_GEMM_Uint8Operands_Uint32Accumulators_intrinsics {
+  typedef std::uint8_t OperandType;
+  typedef std::int32_t AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 2, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 2, CellOrder::DepthMajor>, RhsCells> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    int32x4_t acc[3][4 * RhsCells];
+    for (int i = 0; i < 3; i++) {
+      for (int j = 0; j < 4 * RhsCells; j++) {
+        acc[i][j] = vld1q_s32(accum_ptr + 4 * (i + 3 * j));
+      }
+    }
+    for (int d = 0; d < depth; d += 2) {
+      int16x8_t lhs[3];
+      for (int i = 0; i < 3; i++) {
+        lhs[i] = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(lhs_ptr + 8 * i)));
+      }
+      int16x8_t rhs[RhsCells];
+      for (int i = 0; i < RhsCells; i++) {
+        rhs[i] = vreinterpretq_s16_u16(vmovl_u8(vld1_u8(rhs_ptr + 8 * i)));
+      }
+      for (int i = 0; i < 3; i++) {
+        for (int j = 0; j < RhsCells; j++) {
+          acc[i][4 * j + 0] = vmlal_lane_s16(
+              acc[i][4 * j + 0], vget_low_s16(lhs[i]), vget_low_s16(rhs[j]), 0);
+          acc[i][4 * j + 1] = vmlal_lane_s16(
+              acc[i][4 * j + 1], vget_low_s16(lhs[i]), vget_low_s16(rhs[j]), 1);
+          acc[i][4 * j + 2] = vmlal_lane_s16(
+              acc[i][4 * j + 2], vget_low_s16(lhs[i]), vget_low_s16(rhs[j]), 2);
+          acc[i][4 * j + 3] = vmlal_lane_s16(
+              acc[i][4 * j + 3], vget_low_s16(lhs[i]), vget_low_s16(rhs[j]), 3);
+          acc[i][4 * j + 0] =
+              vmlal_lane_s16(acc[i][4 * j + 0], vget_high_s16(lhs[i]),
+                             vget_high_s16(rhs[j]), 0);
+          acc[i][4 * j + 1] =
+              vmlal_lane_s16(acc[i][4 * j + 1], vget_high_s16(lhs[i]),
+                             vget_high_s16(rhs[j]), 1);
+          acc[i][4 * j + 2] =
+              vmlal_lane_s16(acc[i][4 * j + 2], vget_high_s16(lhs[i]),
+                             vget_high_s16(rhs[j]), 2);
+          acc[i][4 * j + 3] =
+              vmlal_lane_s16(acc[i][4 * j + 3], vget_high_s16(lhs[i]),
+                             vget_high_s16(rhs[j]), 3);
+        }
+      }
+      lhs_ptr += 24;
+      rhs_ptr += 8 * RhsCells;
+    }
+    for (int i = 0; i < 3; i++) {
+      for (int j = 0; j < 4 * RhsCells; j++) {
+        vst1q_s32(accum_ptr + 4 * (i + 3 * j), acc[i][j]);
+      }
+    }
+  }
+};
+
+using NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics =
+    NEON_GEMM_Uint8Operands_Uint32Accumulators_intrinsics<1>;
+
+using NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics =
+    NEON_GEMM_Uint8Operands_Uint32Accumulators_intrinsics<2>;
+
+template <int RhsCells>
+struct NEON_GEMM_Float32_WithScalar_intrinsics {
+  typedef float OperandType;
+  typedef float AccumulatorType;
+  typedef KernelFormat<
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, 3>,
+      KernelSideFormat<CellFormat<4, 1, CellOrder::DepthMajor>, RhsCells> >
+      Format;
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    float32x4_t acc[3][4 * RhsCells];
+    for (int i = 0; i < 3; i++) {
+      for (int j = 0; j < 4 * RhsCells; j++) {
+        acc[i][j] = vld1q_f32(accum_ptr + 4 * (i + 3 * j));
+      }
+    }
+    for (int d = 0; d < depth; d++) {
+      float32x4_t lhs[3];
+      for (int i = 0; i < 3; i++) {
+        lhs[i] = vld1q_f32(lhs_ptr + 4 * i);
+      }
+      float32x4_t rhs[RhsCells];
+      for (int i = 0; i < RhsCells; i++) {
+        rhs[i] = vld1q_f32(rhs_ptr + 4 * i);
+      }
+      for (int i = 0; i < 3; i++) {
+        for (int j = 0; j < RhsCells; j++) {
+          acc[i][4 * j + 0] = vmlaq_lane_f32(acc[i][4 * j + 0], lhs[i],
+                                             vget_low_f32(rhs[j]), 0);
+          acc[i][4 * j + 1] = vmlaq_lane_f32(acc[i][4 * j + 1], lhs[i],
+                                             vget_low_f32(rhs[j]), 1);
+          acc[i][4 * j + 2] = vmlaq_lane_f32(acc[i][4 * j + 2], lhs[i],
+                                             vget_high_f32(rhs[j]), 0);
+          acc[i][4 * j + 3] = vmlaq_lane_f32(acc[i][4 * j + 3], lhs[i],
+                                             vget_high_f32(rhs[j]), 1);
+        }
+      }
+      lhs_ptr += 12;
+      rhs_ptr += 4 * RhsCells;
+    }
+    for (int i = 0; i < 3; i++) {
+      for (int j = 0; j < 4 * RhsCells; j++) {
+        vst1q_f32(accum_ptr + 4 * (i + 3 * j), acc[i][j]);
+      }
+    }
+  }
+};
+
+using NEON_32bit_GEMM_Float32_WithScalar_intrinsics =
+    NEON_GEMM_Float32_WithScalar_intrinsics<1>;
+
+using NEON_64bit_GEMM_Float32_WithScalar_intrinsics =
+    NEON_GEMM_Float32_WithScalar_intrinsics<2>;
+
+// BEGIN code copied from gemmlowp/internal/kernel_reference.h
+
+// This kernel is templatized in an arbitrary Format template parameter,
+// allowing it to have any arbitrary format.
+template <typename tOperandType, typename tAccumulatorType, typename tFormat>
+struct ReferenceKernel {
+  typedef tOperandType OperandType;
+  typedef tAccumulatorType AccumulatorType;
+  typedef tFormat Format;
+
+  static void Run(const OperandType* lhs_ptr, const OperandType* rhs_ptr,
+                  AccumulatorType* accum_ptr, int depth) {
+    const int depth_cells = static_cast<int>(depth / Format::kDepth);
+
+    // The outer loop is over the depth dimension.
+    for (int dc = 0; dc < depth_cells; dc++) {
+      // The next two loops are over cells of the Lhs (stacked vertically),
+      // and over cells of the Rhs (stacked horizontally).
+      for (int rc = 0; rc < Format::Lhs::kCells; rc++) {
+        const OperandType* lhs_cell_ptr =
+            lhs_ptr + (dc * Format::Lhs::kCells + rc) *
+                          Format::Lhs::Cell::kWidth * Format::kDepth;
+        for (int cc = 0; cc < Format::Rhs::kCells; cc++) {
+          const OperandType* rhs_cell_ptr =
+              rhs_ptr + (dc * Format::Rhs::kCells + cc) *
+                            Format::Rhs::Cell::kWidth * Format::kDepth;
+
+          // Now we are inside one cell of the Lhs and inside one cell
+          // of the Rhs, so the remaining inner loops are just
+          // traditional three loops of matrix multiplication.
+          for (int di = 0; di < Format::kDepth; di++) {
+            for (int ri = 0; ri < Format::Lhs::Cell::kWidth; ri++) {
+              for (int ci = 0; ci < Format::Rhs::Cell::kWidth; ci++) {
+                const OperandType* lhs_coeff_ptr =
+                    lhs_cell_ptr +
+                    OffsetIntoCell<typename Format::Lhs::Cell>(ri, di);
+                const OperandType* rhs_coeff_ptr =
+                    rhs_cell_ptr +
+                    OffsetIntoCell<typename Format::Rhs::Cell>(ci, di);
+                AccumulatorType* accumulator_coeff_ptr =
+                    accum_ptr + (ri + rc * Format::Lhs::Cell::kWidth) +
+                    (ci + cc * Format::Rhs::Cell::kWidth) * Format::kRows;
+                *accumulator_coeff_ptr += AccumulatorType(*lhs_coeff_ptr) *
+                                          AccumulatorType(*rhs_coeff_ptr);
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+};
+
+// END code copied from gemmlowp/internal/kernel_reference.h
+
+template <typename DataType>
+class CacheLineAlignedBuffer {
+ public:
+  CacheLineAlignedBuffer(std::size_t size) : size_(size) {
+    data_ = nullptr;
+    // Adds a few bytes of padding here, because the 64-bit 'A57' kernel
+    // reads one iteration past the end the buffer, causing a crash on iOS.
+    posix_memalign(reinterpret_cast<void**>(&data_), kCacheLineSize,
+                   size_ * sizeof(DataType) + 16);
+  }
+
+  ~CacheLineAlignedBuffer() { free(data_); }
+
+  const DataType* data() const { return data_; }
+  DataType* data() { return data_; }
+
+  const std::size_t size() const { return size_; }
+
+ private:
+  const std::size_t size_;
+  DataType* data_;
+};
+
+template <typename DataType>
+void FillRandom(CacheLineAlignedBuffer<DataType>* buffer) {
+  static std::mt19937 generator(0);
+  // 100 is smaller than any nonzero bound of the range of any data type.
+  const DataType kMaxVal = DataType(100);
+  const DataType kMinVal =
+      std::is_signed<DataType>::value ? -kMaxVal : DataType(0);
+  std::uniform_real_distribution<float> dist(kMinVal, kMaxVal);
+  for (std::size_t i = 0; i < buffer->size(); i++) {
+    buffer->data()[i] = DataType(dist(generator));
+  }
+}
+
+template <typename DataType>
+void FillZero(CacheLineAlignedBuffer<DataType>* buffer) {
+  for (std::size_t i = 0; i < buffer->size(); i++) {
+    buffer->data()[i] = DataType(0);
+  }
+}
+
+template <typename DataType>
+void Copy(CacheLineAlignedBuffer<DataType>* dst,
+          const CacheLineAlignedBuffer<DataType>& src) {
+  assert(dst->size() == src.size());
+  memcpy(dst->data(), src.data(), src.size() * sizeof(DataType));
+}
+
+template <typename DataType>
+void PrintMatrix(int rows, int cols, int rowstride, int colstride,
+                 const DataType* data) {
+  for (int r = 0; r < rows; r++) {
+    for (int c = 0; c < cols; c++) {
+      std::cerr << double(data[r * rowstride + c * colstride]) << " ";
+    }
+    std::cerr << std::endl;
+  }
+  std::cerr << std::endl;
+}
+
+template <typename DataType>
+bool approx_equals(DataType a, DataType b) {
+  return a == b;
+}
+
+template <>
+bool approx_equals(float a, float b) {
+  if (!a && !b) {
+    return true;
+  }
+  // 1e-1 is very coarse accuracy, we should switch to an overall L2 metric
+  // and tighten the tolerance on that metric.
+  return std::abs(a - b) < 1e-1f * std::min(std::abs(a), std::abs(b));
+}
+
+template <typename Kernel>
+void test_kernel(int depth, const char* kernel_name) {
+  typedef typename Kernel::OperandType OperandType;
+  typedef typename Kernel::AccumulatorType AccumulatorType;
+  typedef typename Kernel::Format Format;
+  static const int kLhsWidth = Format::Lhs::kWidth;
+  static const int kRhsWidth = Format::Rhs::kWidth;
+
+  typedef ReferenceKernel<OperandType, AccumulatorType, Format> ReferenceKernel;
+
+  CacheLineAlignedBuffer<OperandType> lhs(kLhsWidth * depth);
+  CacheLineAlignedBuffer<OperandType> rhs(kRhsWidth * depth);
+  CacheLineAlignedBuffer<AccumulatorType> accum_initial(kLhsWidth * kRhsWidth);
+  CacheLineAlignedBuffer<AccumulatorType> accum(kLhsWidth * kRhsWidth);
+  CacheLineAlignedBuffer<AccumulatorType> accum_reference(kLhsWidth *
+                                                          kRhsWidth);
+
+  FillRandom(&lhs);
+  FillRandom(&rhs);
+  FillRandom(&accum_initial);
+  Copy(&accum, accum_initial);
+  Copy(&accum_reference, accum_initial);
+
+  ReferenceKernel::Run(lhs.data(), rhs.data(), accum_reference.data(), depth);
+  Kernel::Run(lhs.data(), rhs.data(), accum.data(), depth);
+
+  for (int l = 0; l < kLhsWidth; l++) {
+    for (int r = 0; r < kRhsWidth; r++) {
+      const int index = l + kLhsWidth * r;
+      if (!approx_equals(accum.data()[index], accum_reference.data()[index])) {
+        std::cerr << "Arithmetic error in kernel:" << std::endl
+                  << "    " << kernel_name << std::endl
+                  << "Wrong accumulator for depth=" << depth << ", "
+                  << "at l = " << l << ", r = " << r << std::endl;
+        std::cerr << "reference value: " << accum_reference.data()[index]
+                  << std::endl;
+        std::cerr << "actual value:    " << accum.data()[index] << std::endl;
+        if (depth <= 16) {
+          std::cerr << "LHS matrix:" << std::endl;
+          PrintMatrix(kLhsWidth, depth, 1, kLhsWidth, lhs.data());
+          std::cerr << "RHS matrix:" << std::endl;
+          PrintMatrix(depth, kRhsWidth, kRhsWidth, 1, rhs.data());
+          std::cerr << "Initial Accumulator matrix:" << std::endl;
+          PrintMatrix(kLhsWidth, kRhsWidth, 1, kLhsWidth, accum_initial.data());
+          std::cerr << "Reference Accumulator matrix:" << std::endl;
+          PrintMatrix(kLhsWidth, kRhsWidth, 1, kLhsWidth,
+                      accum_reference.data());
+          std::cerr << "Actual Accumulator matrix:" << std::endl;
+          PrintMatrix(kLhsWidth, kRhsWidth, 1, kLhsWidth, accum.data());
+        }
+        abort();
+      }
+    }
+  }
+}
+
+template <typename Kernel>
+int ops(int depth) {
+  // 2x the number of multiply-accumulate scalar ops.
+  return 2 * Kernel::Format::Lhs::kWidth * Kernel::Format::Rhs::kWidth * depth;
+}
+
+template <unsigned Modulus, typename Integer>
+Integer RoundDown(Integer i) {
+  return i - (i % Modulus);
+}
+
+int CacheSizeInKB() {
+  static const char* cache_size_k_env = getenv("CACHE_SIZE_KB");
+  static const int cache_size_k =
+      cache_size_k_env ? atoi(cache_size_k_env) : kDefaultCacheSizeK;
+  return cache_size_k;
+}
+
+template <typename Kernel>
+int BenchmarkDepthToFitInCache() {
+  const int cache_size_bytes = 1024 * CacheSizeInKB();
+
+  // Subtract the typical size of a few cache lines, so
+  // we don't need to worry too hard about e.g. some stack data.
+  const int conservative_cache_size_bytes =
+      cache_size_bytes - 2 * kCacheLineSize;
+
+  // We will subtract the memory occupied by accumulators.
+  typedef typename Kernel::AccumulatorType AccumulatorType;
+  const int kAccumulatorBytes = sizeof(AccumulatorType) *
+                                Kernel::Format::Lhs::kWidth *
+                                Kernel::Format::Rhs::kWidth;
+
+  // Compute the depth.
+  typedef typename Kernel::OperandType OperandType;
+  const int kBytesPerUnitOfDepth =
+      sizeof(OperandType) *
+      (Kernel::Format::Lhs::kWidth + Kernel::Format::Rhs::kWidth);
+  const int unrounded_depth =
+      (conservative_cache_size_bytes - kAccumulatorBytes) /
+      kBytesPerUnitOfDepth;
+
+  // Cap depth, to avoid unfairly favoring narrower kernels
+  const int kMaxDepth = 1024;
+  const int clamped_unrounded_depth = std::min(kMaxDepth, unrounded_depth);
+
+  // Round depth down to a multiple of cache line size, which helps because
+  // our kernels may crash if depth is not a multiple of the number of
+  // depth level that they want to
+  // handle at each loop iteration, and we don't want to require kernels
+  // to be more complex. Currently all kernels process 1, 2 or 8 levels of
+  // depth at a time. The main reason why that might increase in the future
+  // is if registers get wider, but I don't suppose that register could
+  // ever get wider than cache lines.
+  return RoundDown<kCacheLineSize>(clamped_unrounded_depth);
+}
+
+double current_time_in_seconds() {
+  timespec t;
+  clock_gettime(CLOCK_REALTIME, &t);
+  return t.tv_sec + 1e-9 * t.tv_nsec;
+}
+
+template <typename Kernel>
+double benchmark(int depth) {
+  // Minimum duration for this benchmark to run. If the workload finishes
+  // sooner, we retry with double the number of iterations.
+  static const double min_benchmark_time_in_seconds = 1.0;
+
+  typedef typename Kernel::OperandType OperandType;
+  typedef typename Kernel::AccumulatorType AccumulatorType;
+
+  CacheLineAlignedBuffer<OperandType> lhs(Kernel::Format::Lhs::kWidth * depth);
+  CacheLineAlignedBuffer<OperandType> rhs(Kernel::Format::Rhs::kWidth * depth);
+  CacheLineAlignedBuffer<AccumulatorType> accum(Kernel::Format::Lhs::kWidth *
+                                                Kernel::Format::Rhs::kWidth);
+
+  for (std::uint64_t iters_at_a_time = 1;; iters_at_a_time *= 2) {
+    const double t_start = current_time_in_seconds();
+    for (std::uint64_t i = 0; i < iters_at_a_time; i++) {
+      Kernel::Run(lhs.data(), rhs.data(), accum.data(), depth);
+    }
+    const double t_end = current_time_in_seconds();
+    const double elapsed = t_end - t_start;
+    if (elapsed > min_benchmark_time_in_seconds) {
+      return iters_at_a_time * ops<Kernel>(depth) / elapsed;
+    }
+  }
+}
+
+template <typename Kernel>
+void benchmark_and_print_results(const char* kernel_name) {
+  if (getenv("BENCHMARK_KERNEL")) {
+    if (strcmp(getenv("BENCHMARK_KERNEL"), kernel_name)) {
+      return;
+    }
+  }
+  const int kKernelDepth = Kernel::Format::kDepth;
+  for (int depth = kKernelDepth; depth <= 1024; depth += kKernelDepth) {
+    test_kernel<Kernel>(depth, kernel_name);
+  }
+
+  if (getenv("BENCHMARK_ALL_DEPTHS")) {
+    for (int depth = kKernelDepth;
+         depth <= BenchmarkDepthToFitInCache<Kernel>(); depth *= 2) {
+      std::cout << kernel_name << "," << depth << ","
+                << benchmark<Kernel>(depth) * 1e-9f << std::endl;
+    }
+  } else {
+    const int depth = BenchmarkDepthToFitInCache<Kernel>();
+    std::cout << kernel_name << "," << benchmark<Kernel>(depth) * 1e-9f
+              << std::endl;
+  }
+}
+
+#define BENCHMARK(Kernel)                         \
+  do {                                            \
+    benchmark_and_print_results<Kernel>(#Kernel); \
+  } while (false)
+
+int main() {
+  if (getenv("BENCHMARK_ALL_DEPTHS")) {
+    std::cout << "kernel,depth,Gop/s" << std::endl;
+  } else {
+    std::cout << "kernel,Gop/s" << std::endl;
+  }
+
+#ifdef __arm__
+  BENCHMARK(NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits);
+  BENCHMARK(NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics);
+  BENCHMARK(NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators);
+  BENCHMARK(NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics);
+  BENCHMARK(NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand);
+  BENCHMARK(NEON_32bit_GEMM_Int32_WithScalar);
+  BENCHMARK(NEON_32bit_GEMM_Float32_MLA_WithVectorDuplicatingScalar);
+#ifdef __ARM_FEATURE_FMA
+  BENCHMARK(NEON_32bit_GEMM_Float32_FMA_WithVectorDuplicatingScalar);
+#endif
+  BENCHMARK(NEON_32bit_GEMM_Float32_MLA_WithScalar);
+  BENCHMARK(NEON_32bit_GEMM_Float32_WithScalar_intrinsics);
+  BENCHMARK(NEON_32bit_GEMM_Float32_WithScalar_A53);
+  BENCHMARK(NEON_32bit_GEMM_Float32_WithScalar_A53_depth2);
+  BENCHMARK(NEON_32bit_GEMM_Float32_MLA_Rotating);
+#ifdef __ARM_FEATURE_FMA
+  BENCHMARK(NEON_32bit_GEMM_Float32_FMA_Rotating);
+#endif
+#endif
+
+#ifdef __aarch64__
+
+  BENCHMARK(NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits);
+  BENCHMARK(NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics);
+  BENCHMARK(NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators);
+  BENCHMARK(NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics);
+  BENCHMARK(NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A57);
+  BENCHMARK(NEON_64bit_GEMM_Int32_WithScalar);
+  BENCHMARK(NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar);
+  BENCHMARK(NEON_64bit_GEMM_Float32_WithScalar);
+  BENCHMARK(NEON_64bit_GEMM_Float32_WithScalar_intrinsics);
+  BENCHMARK(NEON_64bit_GEMM_Float32_WithScalar_A57);
+#ifndef __APPLE__
+  BENCHMARK(NEON_64bit_GEMM_Float32_WithScalar_A53);
+#endif
+#endif
+
+  return 0;
+}
diff --git a/test/benchmark.cc b/test/benchmark.cc
index a4ef2a5..20dd369 100644
--- a/test/benchmark.cc
+++ b/test/benchmark.cc
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -37,6 +37,11 @@
 #warning "Building without NEON support on ARM, check your compiler setup!"
 #endif
 
+#if defined(__SSE4_2__) && !defined(GEMMLOWP_SSE4)
+#warning \
+    "Building without SSE4.2 support on SSE4.2 enabled machine, check your compiler setup!"
+#endif
+
 namespace gemmlowp {
 
 double time() {
@@ -348,8 +353,8 @@
 
   {
     gemmlowp::GemmContext context;
-    std::cout << "Benchmarking default mode (typically multi-threaded)..."
-              << std::endl;
+    context.set_max_num_threads(0);
+    std::cout << "Benchmarking multi-threaded mode..." << std::endl;
     gemmlowp::benchmark(&context);
   }
 
diff --git a/test/benchmark_meta_gemm.cc b/test/benchmark_meta_gemm.cc
index 9d54435..b7bae87 100644
--- a/test/benchmark_meta_gemm.cc
+++ b/test/benchmark_meta_gemm.cc
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -289,6 +289,17 @@
     shape.init();
   }
 
+  std::vector<Shape> lstm;
+  lstm.push_back(Shape(1, 500, 320));
+  lstm.push_back(Shape(1, 100, 500));
+  lstm.push_back(Shape(1, 500, 500));
+  lstm.push_back(Shape(1, 500, 100));
+  lstm.push_back(Shape(1, 2000, 100));
+
+  for (auto& shape : lstm) {
+    shape.init();
+  }
+
   gemmlowp::eight_bit_int_gemm::SetMaxNumThreads(4);
 
   std::cout << "Warmup run." << std::endl;
@@ -296,7 +307,7 @@
   time_all(&small_gemms, 50, 1.0);
 
   std::cout << "Timing all." << std::endl;
-  time_all(&googlenet_gemms, 10, 20.0);
+  time_all(&googlenet_gemms, 10, 10.0);
   time_all(&small_gemms, 50, 10.0);
 
   std::cout << "Timing separate." << std::endl;
@@ -313,5 +324,9 @@
     time_one(&shape, 0.10);
   }
 
+  for (auto& shape : lstm) {
+    time_one(&shape, 0.10);
+  }
+
   return 0;
 }
diff --git a/test/correctness_meta_gemm.cc b/test/correctness_meta_gemm.cc
index 4c89ded..6abc25a 100644
--- a/test/correctness_meta_gemm.cc
+++ b/test/correctness_meta_gemm.cc
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -24,9 +24,13 @@
 #include <map>
 #include <vector>
 
+#include "../meta/legacy_multi_thread_gemm.h"
 #include "../public/gemmlowp.h"
-#include "../meta/multi_thread_gemm.h"
 #include "test.h"
+// lets include these so we make sure they always compile
+#include "../meta/multi_thread_gemm.h"
+#include "../meta/multi_thread_transform.h"
+#include "../meta/legacy_multi_thread_common.h"
 
 #if defined(__arm__) && !defined(GEMMLOWP_NEON)
 #warning "Building without NEON support on ARM, check your compiler setup!"
@@ -46,7 +50,7 @@
 
 void prepare_test_data(std::uint8_t* data, std::int32_t rows, std::int32_t cols,
                        std::int32_t seed, std::int32_t seed_2) {
-  int32_t value = seed;
+  std::int32_t value = seed;
   for (int i = 0; i < rows; ++i) {
     for (int j = 0; j < cols; ++j) {
       data[i * cols + j] = static_cast<std::uint8_t>(value);
@@ -55,9 +59,6 @@
   }
 }
 
-bool verbose = false;
-bool quiet = true;
-
 void check_result(std::uint8_t* left, std::uint8_t* right, std::uint8_t* result,
                   std::int32_t rows, std::int32_t cols, std::int32_t depth,
                   std::int32_t lhs_offset, std::int32_t rhs_offset,
@@ -84,31 +85,18 @@
       }
       expected = static_cast<std::int32_t>(static_cast<std::uint8_t>(expected));
       std::int32_t actual = static_cast<std::int32_t>(result[i * cols + j]);
-      if (actual == expected) {
-        if (!quiet) {
-          if (verbose) {
-            std::cout << expected << "==" << actual << " ";
-          } else {
-            std::cout << ".";
-          }
-        }
-      } else {
-        if (!quiet) {
-          if (verbose) {
-            std::cout << expected << "!=" << actual << " ";
-          } else {
-            std::cout << "x";
-          }
-        }
+      if (actual != expected) {
+        std::cout << "(" << i << ", " << j << "): " << expected << "!="
+                  << actual << std::endl;
         wrong++;
       }
     }
-    if (!quiet) {
-      std::cout << std::endl;
-    }
   }
   if (wrong > 0) {
-    std::cout << "Wrong: " << wrong << std::endl;
+    std::cout << "Wrong: " << rows << "x" << cols << "x" << depth << " : "
+              << wrong << "/" << (rows * cols) << std::endl
+              << std::flush;
+    std::exit(1);
   } else {
     std::cout << "." << std::flush;
   }
@@ -129,31 +117,50 @@
       }
       float expected_float = static_cast<float>(expected) * result_offset;
       float actual_float = result[i * cols + j];
-      if (actual_float == expected_float) {
-        if (!quiet) {
-          if (verbose) {
-            std::cout << expected_float << "==" << actual_float << " ";
-          } else {
-            std::cout << ".";
-          }
-        }
-      } else {
-        if (!quiet) {
-          if (verbose) {
-            std::cout << expected_float << "!=" << actual_float << " ";
-          } else {
-            std::cout << "x";
-          }
-        }
+      if (actual_float != expected_float) {
+        std::cout << "(" << i << ", " << j << "): " << expected_float << "!="
+                  << actual_float << std::endl;
         wrong++;
       }
     }
-    if (!quiet) {
-      std::cout << std::endl;
+  }
+  if (wrong > 0) {
+    std::cout << "Wrong: " << rows << "x" << cols << "x" << depth << " : "
+              << wrong << "/" << (rows * cols) << std::endl
+              << std::flush;
+    std::exit(1);
+  } else {
+    std::cout << "." << std::flush;
+  }
+}
+
+
+void check_result_i32(std::uint8_t* left, std::uint8_t* right,
+                      std::int32_t* result, std::int32_t rows,
+                      std::int32_t cols, std::int32_t depth,
+                      std::int32_t lhs_offset, std::int32_t rhs_offset) {
+  std::int32_t wrong = 0;
+  for (int i = 0; i < rows; ++i) {
+    for (int j = 0; j < cols; ++j) {
+      std::int32_t expected = 0;
+      for (int k = 0; k < depth; ++k) {
+        expected +=
+            (static_cast<std::int32_t>(left[depth * i + k]) + lhs_offset) *
+            (static_cast<std::int32_t>(right[depth * j + k]) + rhs_offset);
+      }
+      std::int32_t actual = result[i * cols + j];
+      if (actual != expected) {
+        std::cout << "(" << i << ", " << j << "): " << expected << "!="
+                  << actual << std::endl;
+        wrong++;
+      }
     }
   }
   if (wrong > 0) {
-    std::cout << "Wrong: " << wrong << std::endl;
+    std::cout << "Wrong: " << rows << "x" << cols << "x" << depth << " : "
+              << wrong << "/" << (rows * cols) << std::endl
+              << std::flush;
+    std::exit(1);
   } else {
     std::cout << "." << std::flush;
   }
@@ -191,43 +198,147 @@
   check_result_f(lhs, rhs, result, m, n, k, -127, -127, scale);
 }
 
-int main() {
-  const std::int32_t min_n = 256;
-  const std::int32_t min_m = 256;
-  const std::int32_t min_k = 256;
+void test_i32(std::uint8_t* scratch, std::uint8_t* lhs, std::uint8_t* rhs,
+              std::int32_t m, std::int32_t n, std::int32_t k,
+              std::int32_t* result, gemmlowp::WorkersPool* pool,
+              std::int32_t pool_size) {
+  prepare_test_data(lhs, m, k, 11, 13);
+  prepare_test_data(rhs, n, k, 177, 19);
+
+  clear(result, m, n);
+  gemmlowp::meta::multi_thread_gemm_i32(pool, pool_size, scratch, lhs, rhs, m,
+                                        n, k, -127, -127, result);
+  check_result_i32(lhs, rhs, result, m, n, k, -127, -127);
+}
+
+void q_suite(int mi, int ni, int ki, int mx, int nx, int kx, int md, int nd,
+             int kd, std::uint8_t* scratch, std::uint8_t* left,
+             std::uint8_t* right, std::uint8_t* result,
+             gemmlowp::WorkersPool* pool, int t) {
+  for (int m = mi; m < mx; m += md) {
+    for (int n = ni; n < nx; n += nd) {
+      for (int k = ki; k < kx; k += kd) {
+        test(scratch, left, right, m, n, k, result, pool, t);
+      }
+    }
+  }
+  std::cout << std::endl;
+}
+
+void f_suite(int mi, int ni, int ki, int mx, int nx, int kx, int md, int nd,
+             int kd, std::uint8_t* scratch, std::uint8_t* left,
+             std::uint8_t* right, float* result, gemmlowp::WorkersPool* pool,
+             int t) {
+  for (int m = mi; m < mx; m += md) {
+    for (int n = ni; n < nx; n += nd) {
+      for (int k = ki; k < kx; k += kd) {
+        test_f(scratch, left, right, m, n, k, result, pool, t);
+      }
+    }
+  }
+  std::cout << std::endl;
+}
+
+void i32_suite(int mi, int ni, int ki, int mx, int nx, int kx, int md, int nd,
+               int kd, std::uint8_t* scratch, std::uint8_t* left,
+               std::uint8_t* right, std::int32_t* result,
+               gemmlowp::WorkersPool* pool, int t) {
+  for (int m = mi; m < mx; m += md) {
+    for (int n = ni; n < nx; n += nd) {
+      for (int k = ki; k < kx; k += kd) {
+        test_i32(scratch, left, right, m, n, k, result, pool, t);
+      }
+    }
+  }
+  std::cout << std::endl;
+}
+
+int main(int argc, char* argv[]) {
+  bool run_long_test = false;
+
+  if (argc > 1 && strcmp(argv[1], "long")) {
+    run_long_test = true;
+  }
+
+  const std::int32_t min_n = 1;
+  const std::int32_t min_m = 1;
+  const std::int32_t min_k = 8;
 
   const std::int32_t max_n = 1024;
   const std::int32_t max_m = 1024;
-  const std::int32_t max_k = 512;
+  const std::int32_t max_k = 2048;
 
   std::uint8_t* left = new std::uint8_t[max_m * max_k];
   std::uint8_t* right = new std::uint8_t[max_n * max_k];
   std::uint8_t* result = new std::uint8_t[max_m * max_n];
   float* result_float = new float[max_m * max_n];
+  std::int32_t* result_i32 = new std::int32_t[max_m * max_n];
   std::uint8_t* scratch = new std::uint8_t[1024 * 1024 * 64];
 
   gemmlowp::WorkersPool pool;
-  pool.CreateWorkers(3);
 
-  std::cout << "Quantized 8 bit." << std::endl << std::flush;
+  int max_repetitions = run_long_test ? 10 : 1;
 
-  for (int m = min_m; m < max_m; m += 128) {
-    for (int n = min_n; n < max_n; n += 128) {
-      for (int k = min_k; k < max_k; k += 13) {
-        test(scratch, left, right, m, n, k, result, &pool, 4);
-      }
+  for (int repetitions = 0; repetitions < max_repetitions; ++repetitions) {
+    int t = std::min(repetitions + 1, 4);
+    std::cout << "Threads: " << t << std::endl << std::flush;
+
+    std::cout << "Quantized 8 bit." << std::endl << std::flush;
+
+    std::cout << "Small." << std::endl << std::flush;
+    q_suite(1, 1, 1, 16, 16, 32, 1, 1, 1, scratch, left, right, result, &pool,
+            t);
+
+    if (run_long_test) {
+      std::cout << "Big." << std::endl << std::flush;
+      q_suite(1, 1, 1, 512, 512, 2048, 111, 111, 111, scratch, left, right,
+              result, &pool, t);
     }
+
+    std::cout << "Gemv." << std::endl << std::flush;
+    q_suite(1, 1, 1, 2, 512, 2048, 1, 111, 111, scratch, left, right, result,
+            &pool, t);
+    q_suite(1, 1, 1, 512, 2, 2048, 111, 1, 111, scratch, left, right, result,
+            &pool, t);
+
+    std::cout << std::endl << "Floats." << std::endl << std::flush;
+
+    std::cout << "Small." << std::endl << std::flush;
+    f_suite(1, 1, 1, 16, 16, 32, 1, 1, 1, scratch, left, right, result_float,
+            &pool, t);
+
+    if (run_long_test) {
+      std::cout << "Big." << std::endl << std::flush;
+      f_suite(1, 1, 1, 512, 512, 2048, 111, 111, 111, scratch, left, right,
+              result_float, &pool, t);
+    }
+
+    std::cout << "Gemv." << std::endl << std::flush;
+    f_suite(1, 1, 1, 2, 512, 2048, 1, 111, 111, scratch, left, right,
+            result_float, &pool, t);
+    f_suite(1, 1, 1, 512, 2, 2048, 111, 1, 111, scratch, left, right,
+            result_float, &pool, t);
+
+    std::cout << std::endl << "Int32." << std::endl << std::flush;
+
+    std::cout << "Small." << std::endl << std::flush;
+    i32_suite(1, 1, 1, 16, 16, 32, 1, 1, 1, scratch, left, right, result_i32,
+              &pool, t);
+
+    if (run_long_test) {
+      std::cout << "Big." << std::endl << std::flush;
+      i32_suite(1, 1, 1, 512, 512, 2048, 111, 111, 111, scratch, left, right,
+                result_i32, &pool, t);
+    }
+
+    std::cout << "Gemv." << std::endl << std::flush;
+    i32_suite(1, 1, 1, 2, 512, 2048, 1, 111, 111, scratch, left, right,
+              result_i32, &pool, t);
+    i32_suite(1, 1, 1, 512, 2, 2048, 111, 1, 111, scratch, left, right,
+              result_i32, &pool, t);
+
+    std::cout << std::endl << std::flush;
   }
 
-  std::cout << std::endl << "Floats." << std::endl << std::flush;
-
-  for (int m = min_m; m < max_m; m += 128) {
-    for (int n = min_n; n < max_n; n += 128) {
-      for (int k = min_k; k < max_k; k += 13) {
-        test_f(scratch, left, right, m, n, k, result_float, &pool, 4);
-      }
-    }
-  }
-
-  std::cout << std::endl << "Done." << std::endl << std::flush;
+  std::cout << "Done." << std::endl << std::flush;
 }
diff --git a/test/test.cc b/test/test.cc
index 9373f2d..fdc7bcc 100644
--- a/test/test.cc
+++ b/test/test.cc
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -15,6 +15,7 @@
 #include "test.h"
 
 #include <unistd.h>
+#include <array>
 #include <cstdint>
 #include <cstdlib>
 #include <ctime>
@@ -34,10 +35,12 @@
 
 void ReferenceEightBitIntGemm(bool transpose_a, bool transpose_b,
                               bool transpose_c, int m, int n, int k,
-                              const uint8_t* a, int32_t a_offset, int lda,
-                              const uint8_t* b, int32_t b_offset, int ldb,
-                              uint8_t* c, int32_t c_offset, int32_t c_mult_int,
-                              int32_t c_shift, int ldc) {
+                              const std::uint8_t* a, std::int32_t a_offset,
+                              int lda, const std::uint8_t* b,
+                              std::int32_t b_offset, int ldb, std::uint8_t* c,
+                              std::int32_t c_offset, std::int32_t c_mult_int,
+                              std::int32_t c_shift, int ldc) {
+  ScopedProfilingLabel("ReferenceEightBitIntGemm");
   assert((c_shift >= 0) && (c_shift <= 32));
 
   assert(a != nullptr);
@@ -77,18 +80,20 @@
 
   for (j = 0; j < n; j++) {
     for (i = 0; i < m; i++) {
-      int32_t total = 0;
+      std::int32_t total = 0;
       for (l = 0; l < k; l++) {
         const int a_index = i * a_i_stride + l * a_l_stride;
-        const uint8_t a_as_byte = a[a_index];
-        const int32_t a_as_int = static_cast<int32_t>(a_as_byte) + a_offset;
+        const std::uint8_t a_as_byte = a[a_index];
+        const std::int32_t a_as_int =
+            static_cast<std::int32_t>(a_as_byte) + a_offset;
         const int b_index = j * b_j_stride + l * b_l_stride;
-        const uint8_t b_as_byte = b[b_index];
-        const int32_t b_as_int = static_cast<int32_t>(b_as_byte) + b_offset;
-        const int32_t mult_as_int = a_as_int * b_as_int;
+        const std::uint8_t b_as_byte = b[b_index];
+        const std::int32_t b_as_int =
+            static_cast<std::int32_t>(b_as_byte) + b_offset;
+        const std::int32_t mult_as_int = a_as_int * b_as_int;
         total += mult_as_int;
       }
-      int32_t output =
+      std::int32_t output =
           (((total + c_offset) * c_mult_int) + kRoundingTerm) >> c_shift;
       if (output > 255) {
         output = 255;
@@ -97,11 +102,16 @@
         output = 0;
       }
       const int c_index = i * c_i_stride + j * c_j_stride;
-      c[c_index] = static_cast<uint8_t>(output);
+      c[c_index] = static_cast<std::uint8_t>(output);
     }
   }
 }
 
+typedef VectorMap<const std::int32_t, VectorShape::Col> OffsetColMap;
+typedef VectorMap<const std::int32_t, VectorShape::Row> OffsetRowMap;
+typedef VectorDup<const std::int32_t, VectorShape::Col> OffsetColDup;
+typedef VectorDup<const std::int32_t, VectorShape::Row> OffsetRowDup;
+
 // *GemmWrapper's allow to wrap various Gemm functions in a uniform
 // interface, so we can use the same testing code to test all of them
 
@@ -118,21 +128,30 @@
   typedef SingleThreadGemmContext Context;
 
   template <MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder>
-  static void Gemm(Context* context,
+  static bool Gemm(Context* context,
                    const MatrixMap<const Scalar, LhsOrder>& lhs,
                    const MatrixMap<const Scalar, RhsOrder>& rhs,
                    MatrixMap<Scalar, ResultOrder>* result, int lhs_offset,
                    int rhs_offset, int result_offset, int result_mult_int,
                    int result_shift) {
-    const OffsetColDup lhs_offset_vector(lhs_offset, lhs.rows());
-    const OffsetRowDup rhs_offset_vector(rhs_offset, rhs.cols());
+    ScopedProfilingLabel("SingleThreadGemmWrapper::Gemm");
+    const int rows = lhs.rows();
+    const int cols = rhs.cols();
+    if (rows < cols) {
+      // SingleThreadGemm is never called with rows < cols.
+      // That case is handled earlier.
+      return false;
+    }
+    const OffsetColDup lhs_offset_vector(lhs_offset, rows);
+    const OffsetRowDup rhs_offset_vector(rhs_offset, cols);
     SingleThreadGemm<typename Kernel::Format, Scalar, Scalar, BitDepthParams,
-                     LhsOrder, RhsOrder, ResultOrder,
-                     OffsetColDup, OffsetRowDup>(
+                     LhsOrder, RhsOrder, ResultOrder, OffsetColDup,
+                     OffsetRowDup>(
         context, Kernel(), lhs, rhs, result, lhs_offset_vector,
         rhs_offset_vector,
         MakeStandardOutputPipeline(result_offset, result_mult_int,
                                    result_shift));
+    return true;
   }
 };
 
@@ -149,21 +168,31 @@
   typedef MultiThreadGemmContext Context;
 
   template <MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder>
-  static void Gemm(Context* context,
+  static bool Gemm(Context* context,
                    const MatrixMap<const Scalar, LhsOrder>& lhs,
                    const MatrixMap<const Scalar, RhsOrder>& rhs,
                    MatrixMap<Scalar, ResultOrder>* result, int lhs_offset,
                    int rhs_offset, int result_offset, int result_mult_int,
                    int result_shift) {
-    const OffsetColDup lhs_offset_vector(lhs_offset, lhs.rows());
-    const OffsetRowDup rhs_offset_vector(rhs_offset, rhs.cols());
+    ScopedProfilingLabel("MultiThreadGemmWrapper::Gemm");
+    context->set_max_num_threads(0);
+    const int rows = lhs.rows();
+    const int cols = rhs.cols();
+    if (rows < cols) {
+      // SingleThreadGemm is never called with rows < cols.
+      // That case is handled earlier.
+      return false;
+    }
+    const OffsetColDup lhs_offset_vector(lhs_offset, rows);
+    const OffsetRowDup rhs_offset_vector(rhs_offset, cols);
     MultiThreadGemm<typename Kernel::Format, Scalar, Scalar, BitDepthParams,
-                    LhsOrder, RhsOrder, ResultOrder,
-                    OffsetColDup, OffsetRowDup>(
+                    LhsOrder, RhsOrder, ResultOrder, OffsetColDup,
+                    OffsetRowDup>(
         context, Kernel(), lhs, rhs, result, lhs_offset_vector,
         rhs_offset_vector,
         MakeStandardOutputPipeline(result_offset, result_mult_int,
                                    result_shift));
+    return true;
   }
 };
 
@@ -176,15 +205,18 @@
   typedef GemmContext Context;
 
   template <MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder>
-  static void Gemm(Context* context,
+  static bool Gemm(Context* context,
                    const MatrixMap<const Scalar, LhsOrder>& lhs,
                    const MatrixMap<const Scalar, RhsOrder>& rhs,
                    MatrixMap<Scalar, ResultOrder>* result, int lhs_offset,
                    int rhs_offset, int result_offset, int result_mult_int,
                    int result_shift) {
-    gemmlowp::Gemm<uint8_t, BitDepthParams, LhsOrder, RhsOrder, ResultOrder>(
-        context, lhs, rhs, result, lhs_offset, rhs_offset, result_offset,
-        result_mult_int, result_shift);
+    ScopedProfilingLabel("PublicGemmWrapper::Gemm");
+    gemmlowp::Gemm<std::uint8_t, BitDepthParams, LhsOrder, RhsOrder,
+                   ResultOrder>(context, lhs, rhs, result, lhs_offset,
+                                rhs_offset, result_offset, result_mult_int,
+                                result_shift);
+    return true;
   }
 };
 
@@ -208,11 +240,12 @@
   typedef void Context;
 
   template <MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder>
-  static void Gemm(Context*, const MatrixMap<const Scalar, LhsOrder>& lhs,
+  static bool Gemm(Context*, const MatrixMap<const Scalar, LhsOrder>& lhs,
                    const MatrixMap<const Scalar, RhsOrder>& rhs,
                    MatrixMap<Scalar, ResultOrder>* result, int lhs_offset,
                    int rhs_offset, int result_offset, int result_mult_int,
                    int result_shift) {
+    ScopedProfilingLabel("EightBitIntGemmWrapper::Gemm");
     const bool transpose_c = ResultOrder == MapOrder::RowMajor;
     const bool transpose_a = LhsOrder == MapOrder::RowMajor;
     const bool transpose_b = RhsOrder == MapOrder::RowMajor;
@@ -221,6 +254,7 @@
         lhs.cols(), lhs.data(), lhs_offset, lhs.stride(), rhs.data(),
         rhs_offset, rhs.stride(), result->data(), result_offset,
         result_mult_int, result_shift, result->stride(), BitDepth);
+    return true;
   }
 };
 
@@ -231,17 +265,19 @@
   static const char* Name() { return "ReferenceEightBitIntGemm"; }
 
   template <MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder>
-  static void Gemm(bool transpose_a, bool transpose_b, bool transpose_c,
+  static bool Gemm(bool transpose_a, bool transpose_b, bool transpose_c,
                    const MatrixMap<const Scalar, LhsOrder>& lhs,
                    const MatrixMap<const Scalar, RhsOrder>& rhs,
                    MatrixMap<Scalar, ResultOrder>* result, int lhs_offset,
                    int rhs_offset, int result_offset, int result_mult_int,
                    int result_shift) {
+    ScopedProfilingLabel("ReferenceEightBitIntGemmWrapper::Gemm");
     ReferenceEightBitIntGemm(transpose_a, transpose_b, transpose_c, lhs.rows(),
                              rhs.cols(), lhs.cols(), lhs.data(), lhs_offset,
                              lhs.stride(), rhs.data(), rhs_offset, rhs.stride(),
                              result->data(), result_offset, result_mult_int,
                              result_shift, result->stride());
+    return true;
   }
 };
 
@@ -268,15 +304,16 @@
   std::vector<int> count_diff_by_pot_slice;
 };
 
-void GetResultStats(const uint8_t* actual, const uint8_t* expected,
+void GetResultStats(const std::uint8_t* actual, const std::uint8_t* expected,
                     size_t count, ResultStats* stats) {
-  std::vector<uint8_t> results;
-  std::vector<int16_t> signed_diffs;
-  std::vector<uint8_t> unsigned_diffs;
-  int64_t signed_diffs_sum = 0;
+  ScopedProfilingLabel("GetResultStats");
+  std::vector<std::uint8_t> results;
+  std::vector<std::int16_t> signed_diffs;
+  std::vector<std::uint8_t> unsigned_diffs;
+  std::int64_t signed_diffs_sum = 0;
   for (size_t i = 0; i < count; i++) {
     results.push_back(actual[i]);
-    int16_t signed_diff = actual[i] - expected[i];
+    std::int16_t signed_diff = actual[i] - expected[i];
     signed_diffs.push_back(signed_diff);
     unsigned_diffs.push_back(std::abs(signed_diff));
     signed_diffs_sum += signed_diff;
@@ -373,9 +410,14 @@
 
   const int result_shift = (result_shift_min + result_shift_max) / 2;
 
-  GemmWrapper::Gemm(context, lhs.const_map(), rhs.const_map(), &result->map(),
-                    lhs_offset, rhs_offset, result_offset, result_mult_int,
-                    result_shift);
+  if (!GemmWrapper::Gemm(context, lhs.const_map(), rhs.const_map(),
+                         &result->map(), lhs_offset, rhs_offset, result_offset,
+                         result_mult_int, result_shift)) {
+    // Internal GEMM functions are not required to handle all cases
+    // (e.g. rows < cols) as these are supposed to have been handled
+    // ahead of them. Their test wrappers return false in that case.
+    return;
+  }
 
   typedef typename ResultType::Scalar Scalar;
   static const MapOrder kLhsOrder = LhsType::kOrder;
@@ -421,25 +463,6 @@
 
   ResultStatsBounds bounds;
 
-  if (BitDepthParams::LhsBitDepth::kBits < 8 ||
-      BitDepthParams::RhsBitDepth::kBits < 8) {
-    // We have very lax requirements on unsigned diff.
-    // We have tighter requirements on signed diff (bias), but only
-    // if the matrix is large enough for things to average out.
-    // For very small sizes, we... basically don't test anything.
-    // The problem is that this test uses unrealistic combinations of
-    // result_mult_int
-    // and result_shift, resulting in potentially wild requantization artifacts
-    // on small GEMMs.
-    int adjust_for_small_sizes = 1000 / (rows * cols);
-    bounds.max_unsigned_diff =
-        std::max(stats.med_val / 2, adjust_for_small_sizes);
-    bounds.med_unsigned_diff =
-        std::max(stats.med_val / 8, adjust_for_small_sizes);
-    bounds.med_signed_diff = std::max(2, adjust_for_small_sizes);
-    bounds.mean_signed_diff = std::max(2, adjust_for_small_sizes);
-  }
-
   // Check results
   const bool good = CheckResultStatsBounds(stats, bounds);
 
@@ -453,8 +476,8 @@
     ReportResultStats(stats, bounds);
 
     int bad_coeffs_printed = 0;
-    for (int c = 0; c < result->cols() && bad_coeffs_printed < 20; c++) {
-      for (int r = 0; r < result->rows() && bad_coeffs_printed < 20; r++) {
+    for (int c = 0; c < result->cols() && bad_coeffs_printed < 200; c++) {
+      for (int r = 0; r < result->rows() && bad_coeffs_printed < 200; r++) {
         if (ref_result(r, c) != (*result)(r, c)) {
           printf("bad coeff: at (%d, %d), expected %d, got %d\n", r, c,
                  ref_result(r, c), (*result)(r, c));
@@ -487,11 +510,12 @@
                int cols, WhatParamsToTest params_to_test) {
   typedef std::uint8_t Scalar;
   typedef Matrix<Scalar, LhsOrder> LhsType;
+  using BitDepthParams = typename GemmWrapper::BitDepthParams;
   LhsType lhs(rows, depth);
-  MakeRandom(&lhs, 8);
+  MakeRandom<typename BitDepthParams::LhsRange>(&lhs);
   typedef Matrix<Scalar, RhsOrder> RhsType;
   RhsType rhs(depth, cols);
-  MakeRandom(&rhs, 8);
+  MakeRandom<typename BitDepthParams::RhsRange>(&rhs);
   typedef Matrix<Scalar, ResultOrder> ResultType;
   ResultType result(rows, cols);
   MakeZero(&result);
@@ -592,13 +616,13 @@
   test_gemm<GemmWrapper>(context, 5, 7, 3, WhatParamsToTest::All,
                          WhatOrdersToTest::OnlyRCC);
   test_gemm<GemmWrapper>(context, 8, 8, 8, WhatParamsToTest::All,
-                         WhatOrdersToTest::OnlyRCC);
+                         WhatOrdersToTest::All);
   test_gemm<GemmWrapper>(context, 16, 16, 16, WhatParamsToTest::All,
                          WhatOrdersToTest::OnlyRCC);
   test_gemm<GemmWrapper>(context, 32, 32, 32, WhatParamsToTest::All,
                          WhatOrdersToTest::OnlyRCC);
   test_gemm<GemmWrapper>(context, 64, 64, 64, WhatParamsToTest::All,
-                         WhatOrdersToTest::OnlyRCC);
+                         WhatOrdersToTest::All);
   test_gemm<GemmWrapper>(context, 128, 128, 128, WhatParamsToTest::All,
                          WhatOrdersToTest::OnlyRCC);
 
@@ -697,7 +721,7 @@
     case eight_bit_int_gemm::BitDepthSetting::A8B8:
       return "Lhs: 8 bit, Rhs: 8 bit";
     case eight_bit_int_gemm::BitDepthSetting::A5B7:
-      return "Lhs: 7 bit, Rhs: 5 bit";
+      return "(legacy, no longer requantizing) Lhs: 7 bit, Rhs: 5 bit";
     default:
       abort();
       return nullptr;
@@ -713,51 +737,40 @@
   const int k = 12;
 
   // 12 x 2, columnwise.
-  const uint8_t a_data[] = {
-     0,  0,  0,  0,  0,  0, 0, 0, 0, 255, 255, 255,
-    64, 64, 64, 64, 64, 64, 0, 0, 0, 255, 255, 255
-  };
+  const std::uint8_t a_data[] = {0,  0,   0,   0,   0,  0,   0,   0,
+                                 0,  255, 255, 255, 64, 64,  64,  64,
+                                 64, 64,  0,   0,   0,  255, 255, 255};
   const int lda = k;
   int a_offset[] = {0, -64};
   MatrixMap<const std::uint8_t, MapOrder::RowMajor> lhs(a_data, m, k, lda);
   const OffsetColMap lhs_offset(a_offset, m);
 
   // 12 x 9, columnwise.
-  const uint8_t b_data[] = {
-      0,   0,   0,   0,   0,   0,   0,   0,   0, 255, 255, 255,
-      0,   0,   0,   0,   0,   0, 255, 255, 255,   0,   0,   0,
-      0,   0,   0, 127, 127, 127,   0,   0,   0, 127, 127, 127,
-      0,   0,   0, 255, 255, 255,   0,   0,   0,   0,   0,   0,
-    255, 255, 255,   0,   0,   0,   0,   0,   0,   0,   0,   0,
-      0,   0,   0, 127, 127, 127,   0,   0,   0, 127, 127, 127,
-      0,   0,   0,   0,   0,   0, 127, 127, 127, 127, 127, 127,
-      0,   0,   0,   0,   0,   0, 127, 127, 127, 127, 127, 127,
-      0,   0,   0, 127, 127, 127, 127, 127, 127, 127, 127, 127
-  };
+  const std::uint8_t b_data[] = {
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   255, 255, 255, 0,   0,
+      0,   0,   0,   0,   255, 255, 255, 0,   0,   0,   0,   0,   0,   127,
+      127, 127, 0,   0,   0,   127, 127, 127, 0,   0,   0,   255, 255, 255,
+      0,   0,   0,   0,   0,   0,   255, 255, 255, 0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   127, 127, 127, 0,   0,   0,   127,
+      127, 127, 0,   0,   0,   0,   0,   0,   127, 127, 127, 127, 127, 127,
+      0,   0,   0,   0,   0,   0,   127, 127, 127, 127, 127, 127, 0,   0,
+      0,   127, 127, 127, 127, 127, 127, 127, 127, 127};
   const int ldb = k;
   int b_offset = -127;
   MatrixMap<const std::uint8_t, MapOrder::ColMajor> rhs(b_data, k, n, ldb);
   const OffsetRowDup rhs_offset(b_offset, rhs.cols());
 
   // 2 x 9, columnwise.
-  const uint8_t expected_c_data[] = {
-    255, 255,
-      0,   0,
-    127, 159,
-      0,  64,
-      0,  64,
-    127, 159,
-    127, 127,
-    127, 127,
-    127, 127
-  };
+  const std::uint8_t expected_c_data[] = {255, 255, 0,   0,   127, 159,
+                                          0,   64,  0,   64,  127, 159,
+                                          127, 127, 127, 127, 127, 127};
   const int ldc = m;
   int c_offset[] = {97155, 97346};
   int c_mult_int[] = {2741, 2741};
   const int c_shift = 21;
 
   const int c_count = m * n;
-  std::unique_ptr<uint8_t[]> output_data(new uint8_t[c_count]);
+  std::unique_ptr<std::uint8_t[]> output_data(new std::uint8_t[c_count]);
   MatrixMap<std::uint8_t, MapOrder::ColMajor> result(output_data.get(), m, n,
                                                      ldc);
   const OffsetColMap result_offset(c_offset, m);
@@ -767,7 +780,8 @@
   GemmContext gemm_context;
   auto output_pipeline = MakeStandardOutputPipeline<VectorShape::Col>(
       result_offset, result_mult_int, result_shift);
-  GemmWithOutputPipelinePC<uint8_t, uint8_t, DefaultL8R8BitDepthParams>(
+  GemmWithOutputPipelinePC<std::uint8_t, std::uint8_t,
+                           DefaultL8R8BitDepthParams>(
       &gemm_context, lhs, rhs, &result, lhs_offset, rhs_offset,
       output_pipeline);
 
@@ -793,191 +807,167 @@
   const int k = 27;
 
   // 27 x 22, column-wise.
-  const uint8_t a_data[] = {
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0, 127, 127, 127, 255, 255, 255,
-       127, 127, 127,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0, 127, 127, 127,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0, 127, 127, 127,  0,  0,  0,
-    51, 51, 51,  51,  51,  51, 51, 51, 51,   0,   0,   0, 255, 255, 255,
-         0,   0,   0, 51, 51, 51,  51,  51,  51, 51, 51, 51,
-    51, 51, 51,   0,   0,   0, 51, 51, 51,  51,  51,  51, 255, 255, 255,
-        51,  51,  51, 51, 51, 51,   0,   0,   0, 51, 51, 51,
-     0,  0,  0,  64,  64,  64,  0,  0,  0,  64,  64,  64, 255, 255, 255,
-        64,  64,  64,  0,  0,  0,  64,  64,  64,  0,  0,  0,
-    36, 36, 36,   0,   0,   0, 36, 36, 36,   0,   0,   0, 255, 255, 255,
-         0,   0,   0, 36, 36, 36,   0,   0,   0, 36, 36, 36,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
-     0,  0,  0,   0,   0,   0,  0,  0,  0,   0,   0,   0, 255, 255, 255,
-         0,   0,   0,  0,  0,  0,   0,   0,   0,  0,  0,  0,
+  const std::uint8_t a_data[] = {
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   255, 255, 255,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   127, 127, 127, 255, 255, 255, 127, 127, 127,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   127, 127, 127,
+      0,   0,   0,   0,   0,   0,   255, 255, 255, 0,   0,   0,   0,   0,   0,
+      127, 127, 127, 0,   0,   0,   51,  51,  51,  51,  51,  51,  51,  51,  51,
+      0,   0,   0,   255, 255, 255, 0,   0,   0,   51,  51,  51,  51,  51,  51,
+      51,  51,  51,  51,  51,  51,  0,   0,   0,   51,  51,  51,  51,  51,  51,
+      255, 255, 255, 51,  51,  51,  51,  51,  51,  0,   0,   0,   51,  51,  51,
+      0,   0,   0,   64,  64,  64,  0,   0,   0,   64,  64,  64,  255, 255, 255,
+      64,  64,  64,  0,   0,   0,   64,  64,  64,  0,   0,   0,   36,  36,  36,
+      0,   0,   0,   36,  36,  36,  0,   0,   0,   255, 255, 255, 0,   0,   0,
+      36,  36,  36,  0,   0,   0,   36,  36,  36,  0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   255, 255, 255, 0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   255, 255, 255, 0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      255, 255, 255, 0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   255, 255, 255,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   255, 255, 255, 0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   255, 255, 255, 0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   255, 255, 255, 0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      255, 255, 255, 0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   255, 255, 255,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   255, 255, 255, 0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   255, 255, 255, 0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   255, 255, 255, 0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      255, 255, 255, 0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   255, 255, 255,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,   255, 255, 255, 0,   0,   0,
+      0,   0,   0,   0,   0,   0,   0,   0,   0,
   };
   const int lda = k;
-  int a_offset[] = {
-      0, 0, 0, -51, -51, 0, -36, 0, 0, 0,
-      0, 0, 0,   0,   0, 0,   0, 0, 0, 0,
-      0, 0
-  };
+  int a_offset[] = {0, 0, 0, -51, -51, 0, -36, 0, 0, 0, 0,
+                    0, 0, 0, 0,   0,   0, 0,   0, 0, 0, 0};
   MatrixMap<const std::uint8_t, MapOrder::RowMajor> lhs(a_data, m, k, lda);
   const OffsetColMap lhs_offset(a_offset, m);
 
   // 27 x 25, column-wise.
-  const uint8_t b_data[] = {
-    127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 119, 119, 119,
-         119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-    127, 127, 127, 127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-    127, 127, 127, 127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-    127, 127, 127, 127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-    127, 127, 127, 127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-         127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127,
-    127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119,
-         119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-    119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 119, 119, 119, 119, 119, 119, 136, 136, 136,
-    119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 119, 119, 119, 136, 136, 136, 119, 119, 119,
-    119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 136, 136, 136, 119, 119, 119, 119, 119, 119,
-    119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-         127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127,
-    127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119,
-         119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-    119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-         136, 136, 136, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-    119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 136, 136, 136,
-         119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-    119, 119, 119, 119, 119, 119, 119, 119, 119, 136, 136, 136, 119, 119, 119,
-         119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-    119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-         127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127,
-    127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119,
-         119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-    119, 119, 119, 119, 119, 119, 136, 136, 136, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-    119, 119, 119, 136, 136, 136, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-    136, 136, 136, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-    119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-         127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127,
-    127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119,
-         119, 119, 119, 127, 127, 127, 127, 127, 127, 127, 127, 127,
-    119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 127, 127, 127, 127, 127, 127, 127, 127, 127,
-    119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 127, 127, 127, 127, 127, 127, 127, 127, 127,
-    119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
-         119, 119, 119, 127, 127, 127, 127, 127, 127, 127, 127, 127,
-    119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119,
-         127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127
-  };
+  const std::uint8_t b_data[] = {
+      127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 127, 119, 119,
+      119, 119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119, 127,
+      127, 127, 127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 127, 127,
+      127, 127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 127, 127, 127,
+      127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 127, 127, 127, 127,
+      127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127,
+      119, 119, 119, 119, 119, 119, 127, 127, 127, 127, 127, 127, 119, 119,
+      119, 119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119, 127,
+      127, 127, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 136, 136, 136, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      136, 136, 136, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 136, 136, 136, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 127, 127, 127,
+      119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119,
+      119, 127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119, 127,
+      127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 136, 136, 136, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      136, 136, 136, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 136, 136, 136, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119,
+      119, 127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127, 127,
+      127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127, 119, 119, 119,
+      119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 136, 136, 136, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      136, 136, 136, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 136, 136, 136, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127, 119,
+      119, 119, 119, 119, 119, 127, 127, 127, 127, 127, 127, 119, 119, 119,
+      119, 119, 119, 127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127,
+      127, 127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 127, 127, 127,
+      127, 127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 127, 127, 127, 127,
+      127, 127, 127, 127, 127, 119, 119, 119, 119, 119, 119, 119, 119, 119,
+      119, 119, 119, 119, 119, 119, 119, 119, 119, 127, 127, 127, 127, 127,
+      127, 127, 127, 127, 119, 119, 119, 119, 119, 119, 127, 127, 127, 119,
+      119, 119, 119, 119, 119, 127, 127, 127, 127, 127, 127, 127, 127, 127,
+      127, 127, 127};
   const int ldb = k;
   int b_offset = -127;
   MatrixMap<const std::uint8_t, MapOrder::ColMajor> rhs(b_data, k, n, ldb);
   const OffsetRowDup rhs_offset(b_offset, rhs.cols());
 
   // 22 x 25, column-wise.
-  const uint8_t expected_c_data[] = {
-      7,  37,  37,  67,  67,  39,  79,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,  37,  87,  67,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,  37,  87,  67,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,  37,  87,  67,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  37,  37,  67,  67,  39,  79,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  37,   7,  67,  87,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,   7,  87,  87,   7, 103,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,  71,  87,  45,  41,  77,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,   7,  87,  87,   7, 103,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  37,   7,  67,  87,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  37,   7,  67,  87,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  71,   7,  45,  87,  41,  77,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-    255, 135, 135, 255, 255, 143, 255, 255, 255, 255, 255, 255, 255, 255, 255,
-         255, 255, 255, 255, 255, 255, 255,
-      7,  71,   7,  45,  87,  41,  77,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  37,   7,  67,  87,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  37,   7,  67,  87,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,   7,  87,  87,   7, 103,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,  71,  87,  45,  41,  77,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,   7,  87,  87,   7, 103,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  37,   7,  67,  87,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  37,  37,  67,  67,  39,  79,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,  37,  87,  67,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,  37,  87,  67,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,   7,  37,  87,  67,  23,  91,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-      7,  37,  37,  67,  67,  39,  79,   7,   7,   7,   7,   7,   7,   7,   7,
-           7,   7,   7,   7,   7,   7,   7,
-     99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,
-          99,  99,  99,  99,  99,  99,  99,
-    111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111,
-         111, 111, 111, 111, 111, 111, 111,
+  const std::uint8_t expected_c_data[] = {
+      7,   37,  37,  67,  67,  39,  79,  7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   37,  87,  67,  23,  91,  7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   37,  87,  67,  23,  91,  7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   37,  87,  67,  23,  91,  7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   37,
+      37,  67,  67,  39,  79,  7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   37,  7,   67,  87,  23,  91,  7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      87,  87,  7,   103, 7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   71,  87,  45,  41,  77,  7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   87,
+      87,  7,   103, 7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   37,  7,   67,  87,  23,  91,  7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   37,  7,   67,  87,
+      23,  91,  7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   71,  7,   45,  87,  41,  77,  7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   255, 135, 135, 255, 255, 143,
+      255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
+      255, 7,   71,  7,   45,  87,  41,  77,  7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   37,  7,   67,  87,  23,  91,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   37,  7,   67,  87,  23,  91,  7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   87,  87,  7,   103, 7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   71,  87,  45,  41,  77,  7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   87,  87,  7,   103, 7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   37,
+      7,   67,  87,  23,  91,  7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   37,  37,  67,  67,  39,  79,  7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   37,
+      87,  67,  23,  91,  7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   37,  87,  67,  23,  91,  7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   37,  87,
+      67,  23,  91,  7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   7,
+      7,   7,   7,   7,   37,  37,  67,  67,  39,  79,  7,   7,   7,   7,   7,
+      7,   7,   7,   7,   7,   7,   7,   7,   7,   7,   99,  99,  99,  99,  99,
+      99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,  99,
+      99,  99,  111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111,
+      111, 111, 111, 111, 111, 111, 111, 111, 111,
   };
   const int ldc = m;
   int c_offset[] = {
-      6477, 12954, 12954, 7793, 7793, 12954, 9282, 6477, 6477, 6477,
-      6477,  6477,  6477, 6477, 6477,  6477, 6477, 6477, 6477, 6477,
-      6477,  6477,
+      6477, 12954, 12954, 7793, 7793, 12954, 9282, 6477, 6477, 6477, 6477,
+      6477, 6477,  6477,  6477, 6477, 6477,  6477, 6477, 6477, 6477, 6477,
   };
   int c_mult_int[] = {
-      41121, 20560, 20560, 34267, 34267, 21937, 28784, 41121, 41121, 41121,
-      41121, 41121, 41121, 41121, 41121, 41121, 41121, 41121, 41121, 41121,
-      41121, 41121,
+      41121, 20560, 20560, 34267, 34267, 21937, 28784, 41121,
+      41121, 41121, 41121, 41121, 41121, 41121, 41121, 41121,
+      41121, 41121, 41121, 41121, 41121, 41121,
   };
   const int c_shift = 21;
 
   const int c_count = m * n;
-  std::unique_ptr<uint8_t[]> output_data(new uint8_t[c_count]);
+  std::unique_ptr<std::uint8_t[]> output_data(new std::uint8_t[c_count]);
   MatrixMap<std::uint8_t, MapOrder::ColMajor> result(output_data.get(), m, n,
                                                      ldc);
   const OffsetColMap result_offset(c_offset, m);
@@ -987,7 +977,8 @@
   GemmContext gemm_context;
   auto output_pipeline = MakeStandardOutputPipeline<VectorShape::Col>(
       result_offset, result_mult_int, result_shift);
-  GemmWithOutputPipelinePC<uint8_t, uint8_t, DefaultL8R8BitDepthParams>(
+  GemmWithOutputPipelinePC<std::uint8_t, std::uint8_t,
+                           DefaultL8R8BitDepthParams>(
       &gemm_context, lhs, rhs, &result, lhs_offset, rhs_offset,
       output_pipeline);
 
@@ -1002,6 +993,119 @@
   Check(good);
 }
 
+// Multithreading only activates when the result has more than 16 rows, and also
+// (result rows) * (result cols) * depth >= 2 x 65 x 1024.  Size was selected
+// to run in 3 threads.
+//
+// Based on the following floating point data:
+//   LHS: all zeros except 10.0, 20.0 at the beginning of first 16 rows;
+//     1.0, 2.0 at the beginning of next 16 rows; 0.1, 0.2 in next 16 rows;
+//     0.01, 0.02 in last 16 rows.
+//   RHS: all zeros except 1.0 in (0, 0) and 2.0 in (1, 0).
+//   Varying boundaries were used for each 16 rows block of LHS, to test for
+//     correct indexing into offsets.
+//   Expected result: all zeros, except 50.0 at the beginning of first 16 rows;
+//     5.0 at the beginning of next 16 rows; 0.5 in next 16 rows; 0.05 in last
+//     16 rows.
+void TestMultithreadedPerChannelQuantization() {
+  const int m = 64;
+  const int n = 20;
+  const int k = 160;
+
+  // LHS, m x k.
+  const std::array<std::int32_t, 4> lhs_offsets_terse{{
+      0, -51, -85, -109,
+  }};
+  assert(lhs_offsets_terse.size() * 16 == m);
+  const std::array<std::uint8_t, 4> lhs_first_el{{
+      128, 153, 170, 182,
+  }};
+  assert(lhs_first_el.size() * 16 == m);
+
+  // lhs_first_el at (i, 0) and 255 at (i, 1), other values are all -offset.
+  std::vector<std::uint8_t> a_data(m * k, 0);
+  for (int i = 0; i < m; ++i) {
+    a_data[i * k] = lhs_first_el[i / 16];
+    a_data[i * k + 1] = 255;
+    for (int j = 2; j < k; ++j) {
+      a_data[i * k + j] = std::uint8_t(-lhs_offsets_terse[i / 16]);
+    }
+  }
+
+  const int lda = k;
+  // Given values at [i / 16].
+  std::vector<std::int32_t> a_offset(m, 0);
+  for (int i = 0; i < m; ++i) {
+    a_offset[i] = lhs_offsets_terse[i / 16];
+  }
+
+  MatrixMap<const std::uint8_t, MapOrder::RowMajor> lhs(&a_data[0], m, k, lda);
+  const OffsetColMap lhs_offset(&a_offset[0], m);
+
+  // RHS, k x n.
+  // All zeros, except 128 at (0, 0) and 255 at (1, 0).
+  std::vector<std::uint8_t> b_data(k * n, 0);
+  b_data[0] = 128;
+  b_data[1] = 255;
+
+  const int ldb = k;
+  std::int32_t b_offset = 0;
+  MatrixMap<const std::uint8_t, MapOrder::ColMajor> rhs(&b_data[0], k, n, ldb);
+  const OffsetRowDup rhs_offset(b_offset, rhs.cols());
+
+  // Result, m x n.
+  // All zeros, except given values at (i / 16, 0).
+  const std::array<std::uint8_t, 4> expected_c_terse{{
+      142, 159, 182, 213,
+  }};
+  assert(expected_c_terse.size() * 16 == m);
+  std::vector<std::uint8_t> expected_c_data(m * n, 0);
+  for (int i = 0; i < m; ++i) {
+    expected_c_data[i] = expected_c_terse[i / 16];
+  }
+
+  const int ldc = m;
+  // All zeros.
+  std::vector<std::int32_t> c_offset(m, 0);
+  // Given values at [i / 16].
+  const std::array<std::int32_t, 4> c_mult_int_terse{{
+      3655, 5140, 7049, 9595,
+  }};
+  assert(c_mult_int_terse.size() * 16 == m);
+  std::vector<std::int32_t> c_mult_int(m);
+  for (int i = 0; i < m; ++i) {
+    c_mult_int[i] = c_mult_int_terse[i / 16];
+  }
+
+  const int c_shift = 21;
+
+  const int c_count = m * n;
+  std::unique_ptr<std::uint8_t[]> output_data(new std::uint8_t[c_count]);
+  MatrixMap<std::uint8_t, MapOrder::ColMajor> result(output_data.get(), m, n,
+                                                     ldc);
+  const OffsetColMap result_offset(&c_offset[0], m);
+  const OffsetColMap result_mult_int(&c_mult_int[0], m);
+  const int result_shift = c_shift;
+
+  GemmContext gemm_context;
+  auto output_pipeline = MakeStandardOutputPipeline<VectorShape::Col>(
+      result_offset, result_mult_int, result_shift);
+  GemmWithOutputPipelinePC<std::uint8_t, std::uint8_t,
+                           DefaultL8R8BitDepthParams>(
+      &gemm_context, lhs, rhs, &result, lhs_offset, rhs_offset,
+      output_pipeline);
+
+  ResultStats stats;
+  GetResultStats(output_data.get(), &expected_c_data[0], c_count, &stats);
+
+  ResultStatsBounds bounds;
+  const bool good = CheckResultStatsBounds(stats, bounds);
+  printf("TestMultithreadedPerChannelQuantization: %s\n",
+         good ? "PASS" : "FAIL");
+  ReportResultStats(stats, bounds);
+  Check(good);
+}
+
 // Runs a small set of hand-calculated data through the implementation.
 void TestWithSmallData() {
   const int m = 4;
@@ -1011,11 +1115,11 @@
   // |  7 | 10 | 13 | 16 |
   // |  8 | 11 | 14 | 17 |
   // |  9 | 12 | 15 | 18 |
-  const uint8_t a_data[] = {7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18};
+  const std::uint8_t a_data[] = {7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18};
   // Matrix B (RHS) is:
   // |  1 |  3 |  5 |
   // |  2 |  4 |  6 |
-  const uint8_t b_data[] = {1, 2, 3, 4, 5, 6};
+  const std::uint8_t b_data[] = {1, 2, 3, 4, 5, 6};
   // Here are the results we expect, from hand calculations:
   // (1 * 7) + (3 * 8) + (5 * 9) = 76
   // (2 * 7) + (4 * 8) + (6 * 9) = 100
@@ -1028,10 +1132,10 @@
   // That means matrix C should be:
   // |  76 | 103 | 130 | 157 |
   // | 100 | 136 | 172 | 208 |
-  const uint8_t expected_data[] = {76, 100, 103, 136, 130, 172, 157, 208};
+  const std::uint8_t expected_data[] = {76, 100, 103, 136, 130, 172, 157, 208};
 
   const int c_count = m * n;
-  std::unique_ptr<uint8_t[]> output_data(new uint8_t[c_count]);
+  std::unique_ptr<std::uint8_t[]> output_data(new std::uint8_t[c_count]);
 
   const bool is_a_transposed = true;
   const bool is_b_transposed = true;
@@ -1066,7 +1170,8 @@
 // captured from an actual neural network run.
 void TestWithRealData(eight_bit_int_gemm::BitDepthSetting BitDepth,
                       int tolerance_median, int tolerance_max) {
-  std::unique_ptr<uint8_t[]> output_data(new uint8_t[test_data::c_count]);
+  std::unique_ptr<std::uint8_t[]> output_data(
+      new std::uint8_t[test_data::c_count]);
   gemmlowp::eight_bit_int_gemm::EightBitIntGemm(
       test_data::is_a_transposed, test_data::is_b_transposed,
       test_data::is_c_transposed, test_data::m, test_data::n, test_data::k,
@@ -1093,14 +1198,14 @@
   Check(good);
 }
 
-template <MapOrder ResultOrder>
+template <typename BitDepthParams, MapOrder ResultOrder>
 void TestOutputStages(int rows, int depth, int cols, int result_offset,
                       int result_mult_int, int result_shift) {
   Matrix<std::uint8_t, MapOrder::RowMajor> lhs(rows, depth);
   Matrix<std::uint8_t, MapOrder::ColMajor> rhs(depth, cols);
   Matrix<std::int32_t, ResultOrder> result_raw_int32(rows, cols);
-  MakeRandom(&lhs, 8);
-  MakeRandom(&rhs, 8);
+  MakeRandom<typename BitDepthParams::LhsRange>(&lhs);
+  MakeRandom<typename BitDepthParams::RhsRange>(&rhs);
   const int lhs_offset = 12;
   const int rhs_offset = -34;
 
@@ -1137,19 +1242,17 @@
       &context, lhs.const_map(), rhs.const_map(), &result_quantized_down_int32,
       lhs_offset, rhs_offset, quantize_down_pipeline);
 
-  std::uint64_t sum = 0;
+  std::int64_t sum = 0;
   for (int r = 0; r < rows; r++) {
     for (int c = 0; c < cols; c++) {
       std::int32_t raw = result_raw_int32(r, c);
-      const std::int32_t rounding =
-          (result_shift < 1) ? 0 : (1 << (result_shift - 1));
-      std::int32_t expected =
-          ((raw + result_offset) * result_mult_int + rounding) >> result_shift;
+      std::int32_t expected = RoundingDivideByPOT(
+          (raw + result_offset) * result_mult_int, result_shift);
       Check(expected == result_quantized_down_int32(r, c));
       sum += expected;
     }
   }
-  std::uint64_t avg = sum / (rows * cols);
+  std::int64_t avg = sum / (rows * cols);
   // Test that the average quantized-down value falls reasonably in the
   // middle of the [0..255] range. Otherwise, the multiplier / shift need to be
   // adjusted.
@@ -1177,8 +1280,10 @@
 
   // Test a bias-addition with row-vector
   std::vector<std::int32_t> row_vector_data(cols);
+  std::uniform_int_distribution<std::int32_t> uniform_minus_500_plus_500(-500,
+                                                                         500);
   for (int i = 0; i < cols; i++) {
-    row_vector_data[i] = (Random() % 1000) - 500;
+    row_vector_data[i] = uniform_minus_500_plus_500(RandomEngine());
   }
   typedef VectorMap<std::int32_t, VectorShape::Row> RowVectorMap;
   RowVectorMap row_vector_map(row_vector_data.data(), cols);
@@ -1199,7 +1304,7 @@
   // Test a bias-addition with column-vector
   std::vector<std::int32_t> col_vector_data(rows);
   for (int i = 0; i < rows; i++) {
-    col_vector_data[i] = (Random() % 1000) - 500;
+    col_vector_data[i] = uniform_minus_500_plus_500(RandomEngine());
   }
   typedef VectorMap<std::int32_t, VectorShape::Col> ColVectorMap;
   ColVectorMap col_vector_map(col_vector_data.data(), rows);
@@ -1301,70 +1406,129 @@
       bias_clamp_quantize_cast_pipeline);
   for (int r = 0; r < rows; r++) {
     for (int c = 0; c < cols; c++) {
-      const std::int32_t rounding =
-          (result_shift < 1) ? 0 : (1 << (result_shift - 1));
-      std::int32_t quantized =
-          ((result_biased_clamped(r, c) + result_offset) * result_mult_int +
-           rounding) >>
-          result_shift;
+      std::int32_t quantized = RoundingDivideByPOT(
+          (result_biased_clamped(r, c) + result_offset) * result_mult_int,
+          result_shift);
       std::uint8_t expected = std::min(std::max(quantized, 0), 255);
       Check(expected == result_biased_clamped_quantized_casted(r, c));
     }
   }
 
+  // Test a pipeline with the fixed-point-multiplier variant stage for the
+  // quantizing down of 32bit accumulators.
+  //
+  // First, figure appropriate fixedpoint multiplier and shift values.
+  std::int32_t result_fixedpoint_multiplier = result_mult_int;
+  std::int32_t result_fixedpoint_shift = result_shift;
+  Check(result_mult_int > 0);
+  Check(result_shift > 0);
+  result_fixedpoint_multiplier = result_mult_int;
+  result_fixedpoint_shift = result_shift - 31;
+  while (result_fixedpoint_multiplier < (1 << 30)) {
+    result_fixedpoint_multiplier <<= 1;
+    result_fixedpoint_shift++;
+  }
+  Check(result_fixedpoint_shift >= 0);
+  // Now test OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint
+  OutputStageQuantizeDownInt32ToUint8ScaleByFixedPoint
+      quantize_down_by_fixedpoint_stage;
+  quantize_down_by_fixedpoint_stage.result_offset_after_shift =
+      static_cast<std::int32_t>(
+          round(static_cast<double>(result_offset * result_mult_int) /
+                (1 << result_shift)));
+  quantize_down_by_fixedpoint_stage.result_fixedpoint_multiplier =
+      result_fixedpoint_multiplier;
+  quantize_down_by_fixedpoint_stage.result_shift = result_fixedpoint_shift;
+  auto quantize_down_by_fixedpoint_pipeline =
+      std::make_tuple(quantize_down_by_fixedpoint_stage);
+  Matrix<std::int32_t, ResultOrder> result_quantized_down_by_fixedpoint_int32(
+      rows, cols);
+  GemmWithOutputPipeline<std::uint8_t, std::int32_t, DefaultL8R8BitDepthParams>(
+      &context, lhs.const_map(), rhs.const_map(),
+      &result_quantized_down_by_fixedpoint_int32, lhs_offset, rhs_offset,
+      quantize_down_by_fixedpoint_pipeline);
+
+  std::vector<std::int32_t> diffs_caused_by_fixedpoint;
+  for (int r = 0; r < rows; r++) {
+    for (int c = 0; c < cols; c++) {
+      const std::int32_t actual =
+          result_quantized_down_by_fixedpoint_int32(r, c);
+      const std::int32_t raw = result_raw_int32(r, c);
+      const std::int32_t expected =
+          quantize_down_by_fixedpoint_stage.result_offset_after_shift +
+          RoundingDivideByPOT(SaturatingRoundingDoublingHighMul(
+                                  raw, result_fixedpoint_multiplier),
+                              result_fixedpoint_shift);
+      Check(actual == expected);
+    }
+  }
+
+  // Test the variant of the familiar default pipeline consisting of
+  // quantize-down and
+  // clamp-and-cast-to-uint8, where we used fixedpoint multipliers for the
+  // downscaling.
+  auto quantize_down_by_fixedpoint_and_saturating_cast_pipeline =
+      std::make_tuple(quantize_down_by_fixedpoint_stage, saturating_cast_stage);
+  Matrix<std::uint8_t, ResultOrder>
+      result_quantized_down_by_fixedpoint_saturated_uint8(rows, cols);
+  GemmWithOutputPipeline<std::uint8_t, std::uint8_t, DefaultL8R8BitDepthParams>(
+      &context, lhs.const_map(), rhs.const_map(),
+      &result_quantized_down_by_fixedpoint_saturated_uint8, lhs_offset,
+      rhs_offset, quantize_down_by_fixedpoint_and_saturating_cast_pipeline);
+
+  for (int r = 0; r < rows; r++) {
+    for (int c = 0; c < cols; c++) {
+      std::int32_t quantized = result_quantized_down_by_fixedpoint_int32(r, c);
+      std::uint8_t expected = std::min(std::max(quantized, 0), 255);
+      Check(expected ==
+            result_quantized_down_by_fixedpoint_saturated_uint8(r, c));
+    }
+  }
+
   printf("TestOutputStages: PASS with ResultOrder=%s\n",
          OrderName(ResultOrder));
 }
 
 #ifndef GEMMLOWP_SKIP_EXHAUSTIVE_TESTS
+template <typename BitDepthParams>
 void TestExhaustively() {
   GemmContext context;
 
   // Test the internal GEMM interfaces
-  test_gemm<SingleThreadGemmWrapper<
-      DefaultKernel<KernelFamily::Gemm, DefaultL8R8BitDepthParams>,
-      std::uint8_t, DefaultL8R8BitDepthParams>>(&context);
+  test_gemm<
+      SingleThreadGemmWrapper<DefaultKernel<BitDepthParams>,
+                              std::uint8_t, BitDepthParams>>(&context);
 
-  test_gemm<MultiThreadGemmWrapper<
-      DefaultKernel<KernelFamily::Gemm, DefaultL8R8BitDepthParams>,
-      std::uint8_t, DefaultL8R8BitDepthParams>>(&context);
+  test_gemm<
+      MultiThreadGemmWrapper<DefaultKernel<BitDepthParams>,
+                             std::uint8_t, BitDepthParams>>(&context);
 
   // Test the public GEMM interfaces
-  test_gemm<PublicGemmWrapper<uint8_t, DefaultL8R8BitDepthParams>>(&context);
-
-  test_gemm<EightBitIntGemmWrapper<uint8_t,
-                                   eight_bit_int_gemm::BitDepthSetting::A8B8>>(
-      &context);
+  test_gemm<PublicGemmWrapper<std::uint8_t, BitDepthParams>>(&context);
 
   // Test GEMV cases (internal interfaces)
-  test_gemv<SingleThreadGemmWrapper<
-      DefaultKernel<KernelFamily::Gemv, DefaultL8R8BitDepthParams>,
-      std::uint8_t, DefaultL8R8BitDepthParams>>(&context);
+  test_gemv<
+      SingleThreadGemmWrapper<DefaultKernel<BitDepthParams>,
+                              std::uint8_t, BitDepthParams>>(&context);
 
-  test_gemv<MultiThreadGemmWrapper<
-      DefaultKernel<KernelFamily::Gemv, DefaultL8R8BitDepthParams>,
-      std::uint8_t, DefaultL8R8BitDepthParams>>(&context);
+  test_gemv<
+      MultiThreadGemmWrapper<DefaultKernel<BitDepthParams>,
+                             std::uint8_t, BitDepthParams>>(&context);
 
   // Test GEMV cases (public interfaces)
-  test_gemv<PublicGemmWrapper<uint8_t, DefaultL8R8BitDepthParams>>(&context);
+  test_gemv<PublicGemmWrapper<std::uint8_t, BitDepthParams>>(&context);
+}
 
-  test_gemv<EightBitIntGemmWrapper<uint8_t,
-                                   eight_bit_int_gemm::BitDepthSetting::A8B8>>(
-      &context);
+template <eight_bit_int_gemm::BitDepthSetting BitDepthSetting>
+void TestExhaustivelyEightBitIntGemm() {
+  GemmContext context;
+  test_gemv<EightBitIntGemmWrapper<std::uint8_t, BitDepthSetting>>(&context);
+  test_gemv<EightBitIntGemmWrapper<std::uint8_t, BitDepthSetting>>(&context);
+  test_gemm<EightBitIntGemmWrapper<std::uint8_t, BitDepthSetting>>(&context);
+}
 
-  // Test other bit depths
-  // L7R5
-  test_gemm<SingleThreadGemmWrapper<
-      DefaultKernel<KernelFamily::Gemm, DefaultL7R5BitDepthParams>,
-      std::uint8_t, DefaultL7R5BitDepthParams>>(&context);
-
-  test_gemv<SingleThreadGemmWrapper<
-      DefaultKernel<KernelFamily::Gemv, DefaultL7R5BitDepthParams>,
-      std::uint8_t, DefaultL7R5BitDepthParams>>(&context);
-
-  test_gemm<EightBitIntGemmWrapper<std::uint8_t,
-                                   eight_bit_int_gemm::BitDepthSetting::A5B7>>(
-      &context);
+void TestKernels() {
+  GemmContext context;
 
   // Test specific kernels with various different formats,
   // to exercises corner cases especially in the packing code.
@@ -1407,8 +1571,21 @@
       KernelSideFormat<CellFormat<1, 4, CellOrder::DepthMajor>, 1>,
       KernelSideFormat<CellFormat<4, 4, CellOrder::Diagonal>, 1>>>>(&context);
 }
+
 #endif  // not GEMMLOWP_SKIP_EXHAUSTIVE_TESTS
 
+template <typename BitDepthParams>
+void TestOutputStages() {
+  // Test non-default output pipelines with various combinations of
+  // output stages.
+  TestOutputStages<BitDepthParams, MapOrder::RowMajor>(63, 10, 127, 5, 17, 14);
+  TestOutputStages<BitDepthParams, MapOrder::ColMajor>(63, 10, 127, 5, 17, 14);
+  TestOutputStages<BitDepthParams, MapOrder::RowMajor>(630, 10, 1270, 5, 17,
+                                                       14);
+  TestOutputStages<BitDepthParams, MapOrder::ColMajor>(630, 10, 1270, 5, 17,
+                                                       14);
+}
+
 void test() {
 #ifdef GEMMLOWP_TEST_PROFILE
   RegisterCurrentThreadForProfiling();
@@ -1419,7 +1596,12 @@
   TestWithSmallData();
 
 #ifndef GEMMLOWP_SKIP_EXHAUSTIVE_TESTS
-  TestExhaustively();
+  TestExhaustively<DefaultL8R8BitDepthParams>();
+  TestExhaustively<L8R8WithLhsNonzeroBitDepthParams>();
+  TestExhaustively<DefaultL7R5BitDepthParams>();  // legacy, same as L8R8
+  TestExhaustivelyEightBitIntGemm<eight_bit_int_gemm::BitDepthSetting::A8B8>();
+  TestExhaustivelyEightBitIntGemm<eight_bit_int_gemm::BitDepthSetting::A5B7>();
+  TestKernels();
 #endif
 
   // Run against actual data from a network evaluation.
@@ -1428,12 +1610,13 @@
 
   // Test non-default output pipelines with various combinations of
   // output stages.
-  TestOutputStages<MapOrder::RowMajor>(63, 10, 127, 5, 17, 14);
-  TestOutputStages<MapOrder::ColMajor>(63, 10, 127, 5, 17, 14);
+  TestOutputStages<DefaultL8R8BitDepthParams>();
+  TestOutputStages<L8R8WithLhsNonzeroBitDepthParams>();
 
   // Test per channel quantization.
   TestWithSmallDataPerChannelQuantization();
   TestWithLargeDataPerChannelQuantization();
+  TestMultithreadedPerChannelQuantization();
 #ifdef GEMMLOWP_TEST_PROFILE
   FinishProfiling();
 #endif
diff --git a/test/test.h b/test/test.h
index a5109fb..b6a540d 100644
--- a/test/test.h
+++ b/test/test.h
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -22,20 +22,15 @@
 #include "../profiling/profiler.h"
 #endif
 
-#include <cstdlib>
 #include <cstring>
 #include <iostream>
+#include <random>
 #include <vector>
 
 #include "../public/gemmlowp.h"
 
 namespace gemmlowp {
 
-inline int Random() {
-  // Use ugly old rand() since this doesn't need to be high quality.
-  return rand();
-}
-
 #define GEMMLOWP_STRINGIFY2(x) #x
 #define GEMMLOWP_STRINGIFY(x) GEMMLOWP_STRINGIFY2(x)
 
@@ -97,19 +92,32 @@
   std::vector<Scalar> storage;
 };
 
-template <typename MatrixType>
-void MakeRandom(MatrixType* m, int bits) {
+std::mt19937& RandomEngine() {
+  static std::mt19937 engine;
+  return engine;
+}
+
+int Random() {
+  std::uniform_int_distribution<int> dist(0, std::numeric_limits<int>::max());
+  return dist(RandomEngine());
+}
+
+template <typename OperandRange, typename MatrixType>
+void MakeRandom(MatrixType* m) {
+  ScopedProfilingLabel("MakeRandom(matrix)");
   typedef typename MatrixType::Scalar Scalar;
-  const Scalar mask = (1 << bits) - 1;
+  std::uniform_int_distribution<Scalar> dist(OperandRange::kMinValue,
+                                             OperandRange::kMaxValue);
   for (int c = 0; c < m->cols(); c++) {
     for (int r = 0; r < m->rows(); r++) {
-      (*m)(r, c) = Random() & mask;
+      (*m)(r, c) = dist(RandomEngine());
     }
   }
 }
 
 template <typename MatrixType>
 void MakeConstant(MatrixType* m, typename MatrixType::Scalar val) {
+  ScopedProfilingLabel("MakeConstant(matrix)");
   for (int c = 0; c < m->cols(); c++) {
     for (int r = 0; r < m->rows(); r++) {
       (*m)(r, c) = val;
@@ -119,6 +127,7 @@
 
 template <typename MatrixType>
 void MakeZero(MatrixType* m) {
+  ScopedProfilingLabel("MakeZero(matrix)");
   MakeConstant(m, 0);
 }
 
diff --git a/test/test_allocator.cc b/test/test_allocator.cc
index ded1efe..8a76709 100644
--- a/test/test_allocator.cc
+++ b/test/test_allocator.cc
@@ -1,4 +1,4 @@
-// Copyright 2015 Google Inc. All Rights Reserved.
+// Copyright 2015 The Gemmlowp Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -12,8 +12,8 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
-#include "test.h"
 #include "../internal/allocator.h"
+#include "test.h"
 
 namespace gemmlowp {
 
diff --git a/test/test_data.cc b/test/test_data.cc
index 63414dd..3fd2459 100644
--- a/test/test_data.cc
+++ b/test/test_data.cc
@@ -1,3 +1,17 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
 #include "test_data.h"
 
 namespace test_data {
diff --git a/test/test_data.h b/test/test_data.h
index 098897f..fac1f8e 100644
--- a/test/test_data.h
+++ b/test/test_data.h
@@ -1,3 +1,17 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
 #ifndef GEMMLOWP_TEST_TEST_DATA_H_
 #define GEMMLOWP_TEST_TEST_DATA_H_
 
diff --git a/test/test_fixedpoint.cc b/test/test_fixedpoint.cc
index 445fc7a..da222f0 100644
--- a/test/test_fixedpoint.cc
+++ b/test/test_fixedpoint.cc
@@ -1,56 +1,339 @@
+// Copyright 2016 The Gemmlowp Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// test_fixedpoint.cc: unit tests covering the fixedpoint/ directory.
+
 #define GEMMLOWP_ENABLE_FIXEDPOINT_CONSTANTS_CHECKS
 
+#include <algorithm>
+#include <cmath>
+#include <random>
+#include <vector>
 #include "test.h"
 
-#include "../internal/fixedpoint.h"
+#include "../fixedpoint/fixedpoint.h"
 
-using namespace gemmlowp;
+namespace gemmlowp {
+
+namespace {
+
+// Explanation of SimdVector type and associated functions
+// (LoadSimdVector, StoreSimdVector):
+// The fixedpoint stuff being tested here is generic in an underlying
+// integer type which may be either scalar (int32_t) or SIMD (e.g.
+// NEON int32x4_t). We want to write uniform tests that can test
+// both the scalar and SIMD paths. We achieve this by having this
+// generic SimdVector abstraction, local to this test.
+
+#ifdef GEMMLOWP_NEON
+using SimdVector = int32x4_t;
+constexpr std::size_t SimdVectorSize = 4;
+SimdVector LoadSimdVector(const std::int32_t* src) { return vld1q_s32(src); }
+void StoreSimdVector(std::int32_t* dst, SimdVector v) { vst1q_s32(dst, v); }
+#elif defined(GEMMLOWP_SSE4)
+using SimdVector = __m128i;
+constexpr std::size_t SimdVectorSize = 4;
+SimdVector LoadSimdVector(const std::int32_t* src) {
+  return _mm_loadu_si128(reinterpret_cast<const __m128i*>(src));
+}
+void StoreSimdVector(std::int32_t* dst, SimdVector v) {
+  _mm_storeu_si128(reinterpret_cast<__m128i*>(dst), v);
+}
+#else
+using SimdVector = std::int32_t;
+constexpr std::size_t SimdVectorSize = 1;
+SimdVector LoadSimdVector(const std::int32_t* src) { return *src; }
+void StoreSimdVector(std::int32_t* dst, SimdVector v) { *dst = v; }
+#endif
+
+// Explanation of UnaryOpBase, its *Op subclasses below, and TestUnaryOp:
+// Most (though not all) of the fixedpoint functionality being tested
+// consists of functions taking one fixedpoint value and returning one
+// fixedpoint value, e.g. "exp" or "tanh". We call them "unary operators".
+// We factor a lot of testing boilerplate into a common TestUnaryOp function
+// taking a "unary op" object that fully describes the function to be tested.
+// These objects inherit UnaryOpBase mostly as a means to share some default
+// values for some properties.
+//
+// An important design element here is that the fixed-point values are passed
+// around as raw integers (e.g. int32_t or SIMD types such as int32x4_t), not
+// as higher-level FixedPoint objects. The motivation for this design is 1) to
+// avoid having to templatize everything in the tIntegerBits parameter of
+// class FixedPoint, and 2) to allow directly testing low-level functions
+// operating on raw types (e.g. RoundingDivideByPOT) without needlessly
+// requiring
+// wrapping raw values in FixedPoint objects.
+class UnaryOpBase {
+ public:
+  // Min bound of the input range of this op. For example, an op only handling
+  // nonnegative values would return 0.
+  std::int32_t MinInput() const {
+    return std::numeric_limits<std::int32_t>::min();
+  }
+  // Max bound of the input range of this op. For example, an op only handling
+  // nonpositive values would return 0.
+  std::int32_t MaxInput() const {
+    return std::numeric_limits<std::int32_t>::max();
+  }
+  // Tolerated difference between actual and reference int32 values.
+  // Note that the corresponding real-numbers tolerance depends on the number
+  // of integer bits of the fixed-point representation of the results of this
+  // op.
+  // For example, for an op returning fixed-point values with 0 integer bits,
+  // the correspondence between real-number values and raw values is
+  // real_number = (2^31) * raw_value.
+  std::int32_t Tolerance() const { return 0; }
+};
+
+// Op wrapping RoundingDivideByPOT
+class RoundingDivideByPOTOp final : public UnaryOpBase {
+ public:
+  RoundingDivideByPOTOp(int exponent) : exponent_(exponent) {}
+  std::int32_t ReferenceOp(std::int32_t x) const {
+    const double d = static_cast<double>(x) / (1ll << exponent_);
+    return static_cast<std::int32_t>(std::round(d));
+  }
+  template <typename tRawType>
+  tRawType Op(tRawType x) const {
+    return RoundingDivideByPOT(x, exponent_);
+  }
+
+ private:
+  const int exponent_;
+};
+
+// Op wrapping SaturatingRoundingMultiplyByPOT
+template <int tExponent>
+class SaturatingRoundingMultiplyByPOTOp final : public UnaryOpBase {
+ public:
+  std::int32_t ReferenceOp(std::int32_t x) const {
+    const double d = static_cast<double>(x) * std::pow(2., tExponent);
+    const double clamp_min = std::numeric_limits<std::int32_t>::min();
+    const double clamp_max = std::numeric_limits<std::int32_t>::max();
+    const double clamped = std::min(clamp_max, std::max(clamp_min, d));
+    return static_cast<std::int32_t>(std::round(clamped));
+  }
+  template <typename tRawType>
+  tRawType Op(tRawType x) const {
+    return SaturatingRoundingMultiplyByPOT<tExponent>(x);
+  }
+};
+
+// Op wrapping exp_on_interval_between_negative_one_quarter_and_0_excl
+class ExpOnIntervalBetweenNegativeOneQuarterAnd0ExclOp final
+    : public UnaryOpBase {
+ public:
+  std::int32_t MinInput() const { return -(1 << 29); }
+  std::int32_t MaxInput() const { return 0; }
+  std::int32_t Tolerance() const { return 500; }
+  std::int32_t ReferenceOp(std::int32_t x) const {
+    using F = FixedPoint<std::int32_t, 0>;
+    const double d = ToDouble(F::FromRaw(x));
+    const double e = std::exp(d);
+    return F::FromDouble(e).raw();
+  }
+  template <typename tRawType>
+  tRawType Op(tRawType x) const {
+    using F = FixedPoint<tRawType, 0>;
+    const F f = F::FromRaw(x);
+    const F e = exp_on_interval_between_negative_one_quarter_and_0_excl(f);
+    return e.raw();
+  }
+};
+
+// Op wrapping exp_on_negative_values
+template <int tIntegerBits>
+class ExpOnNegativeValuesOp final : public UnaryOpBase {
+ public:
+  std::int32_t MaxInput() const { return 0; }
+  std::int32_t Tolerance() const { return 500; }
+  std::int32_t ReferenceOp(std::int32_t x) const {
+    using F = FixedPoint<std::int32_t, tIntegerBits>;
+    using F0 = FixedPoint<std::int32_t, 0>;
+    const double d = ToDouble(F::FromRaw(x));
+    const double e = std::exp(d);
+    return F0::FromDouble(e).raw();
+  }
+  template <typename tRawType>
+  tRawType Op(tRawType x) const {
+    using F = FixedPoint<tRawType, tIntegerBits>;
+    const F f = F::FromRaw(x);
+    return exp_on_negative_values(f).raw();
+  }
+};
+
+// Op wrapping one_minus_x_over_one_plus_x_for_x_in_0_1
+class OneMinusXOverOnePlusXForXIn01Op final : public UnaryOpBase {
+ public:
+  std::int32_t MinInput() const { return 0; }
+  std::int32_t Tolerance() const { return 12; }
+  std::int32_t ReferenceOp(std::int32_t x) const {
+    using F = FixedPoint<std::int32_t, 0>;
+    const double d = ToDouble(F::FromRaw(x));
+    const double e = (1 - d) / (1 + d);
+    return F::FromDouble(e).raw();
+  }
+  template <typename tRawType>
+  tRawType Op(tRawType x) const {
+    using F = FixedPoint<tRawType, 0>;
+    const F f = F::FromRaw(x);
+    return one_minus_x_over_one_plus_x_for_x_in_0_1(f).raw();
+  }
+};
+
+// Op wrapping tanh
+template <int tIntegerBits>
+class TanhOp final : public UnaryOpBase {
+ public:
+  std::int32_t Tolerance() const { return 310; }
+  std::int32_t ReferenceOp(std::int32_t x) const {
+    using F = FixedPoint<std::int32_t, tIntegerBits>;
+    using F0 = FixedPoint<std::int32_t, 0>;
+    const double d = ToDouble(F::FromRaw(x));
+    const double e = std::tanh(d);
+    return F0::FromDouble(e).raw();
+  }
+  template <typename tRawType>
+  tRawType Op(tRawType x) const {
+    using F = FixedPoint<tRawType, tIntegerBits>;
+    const F f = F::FromRaw(x);
+    return tanh(f).raw();
+  }
+};
+
+// Op wrapping one_over_one_plus_x_for_x_in_0_1
+class OneOverOnePlusXForXIn01Op final : public UnaryOpBase {
+ public:
+  std::int32_t MinInput() const { return 0; }
+  std::int32_t Tolerance() const { return 6; }
+  std::int32_t ReferenceOp(std::int32_t x) const {
+    using F = FixedPoint<std::int32_t, 0>;
+    const double d = ToDouble(F::FromRaw(x));
+    const double e = 1 / (1 + d);
+    return F::FromDouble(e).raw();
+  }
+  template <typename tRawType>
+  tRawType Op(tRawType x) const {
+    using F = FixedPoint<tRawType, 0>;
+    const F f = F::FromRaw(x);
+    return one_over_one_plus_x_for_x_in_0_1(f).raw();
+  }
+};
+
+// Op wrapping logistic
+template <int tIntegerBits>
+class LogisticOp final : public UnaryOpBase {
+ public:
+  std::int32_t Tolerance() const { return 155; }
+  std::int32_t ReferenceOp(std::int32_t x) const {
+    using F = FixedPoint<std::int32_t, tIntegerBits>;
+    using F0 = FixedPoint<std::int32_t, 0>;
+    const double d = ToDouble(F::FromRaw(x));
+    const double e = 1 / (1 + std::exp(-d));
+    return F0::FromDouble(e).raw();
+  }
+  template <typename tRawType>
+  tRawType Op(tRawType x) const {
+    using F = FixedPoint<tRawType, tIntegerBits>;
+    const F f = F::FromRaw(x);
+    return logistic(f).raw();
+  }
+};
+
+// Tests a given op, on a given list of int32 input values.
+template <typename tUnaryOpType>
+void TestUnaryOp(const tUnaryOpType& unary_op,
+                 const std::vector<std::int32_t>& testvals_int32) {
+  Check(0 == (testvals_int32.size() % SimdVectorSize));
+  for (std::size_t i = 0; i < testvals_int32.size(); i += SimdVectorSize) {
+    // First, clamp input int32 values accoding to the MinInput() and MaxInput()
+    // bounds returned by the op.
+    std::int32_t input[SimdVectorSize] = {0};
+    for (std::size_t j = 0; j < SimdVectorSize; j++) {
+      const std::int32_t raw_input = testvals_int32[i + j];
+      input[j] = std::min(unary_op.MaxInput(),
+                          std::max(unary_op.MinInput(), raw_input));
+    }
+    // Compute reference results and check that the actual results on
+    // scalar inputs agree with them, to the Tolerance() returned by the op.
+    std::int32_t reference[SimdVectorSize] = {0};
+    std::int32_t actual_scalar[SimdVectorSize] = {0};
+    for (std::size_t j = 0; j < SimdVectorSize; j++) {
+      reference[j] = unary_op.ReferenceOp(input[j]);
+      actual_scalar[j] = unary_op.Op(input[j]);
+      const std::int64_t diff = static_cast<std::int64_t>(actual_scalar[j]) -
+                                static_cast<std::int64_t>(reference[j]);
+      Check(std::abs(diff) <= unary_op.Tolerance());
+    }
+    // Check that the actual results on SIMD inputs agree *exactly* with the
+    // actual results on scalar inputs. I.e. SIMD must make absolutely no
+    // difference
+    // to the results, regardless of the fact that both scalar and SIMD results
+    // may differ from the reference results.
+    std::int32_t actual_simd[SimdVectorSize] = {0};
+    StoreSimdVector(actual_simd, unary_op.Op(LoadSimdVector(input)));
+    for (std::size_t j = 0; j < SimdVectorSize; j++) {
+      Check(actual_simd[j] == actual_scalar[j]);
+    }
+  }
+}
 
 template <int tIntegerBits>
-void test_convert(FixedPoint<int32_t, tIntegerBits> x) {
-  typedef FixedPoint<int32_t, tIntegerBits> F;
-  F y = ToFixedPoint<int32_t, tIntegerBits>(ToDouble(x));
+void test_convert(FixedPoint<std::int32_t, tIntegerBits> x) {
+  typedef FixedPoint<std::int32_t, tIntegerBits> F;
+  F y = F::FromDouble(ToDouble(x));
   Check(y == x);
 }
 
 template <int tIntegerBits_a, int tIntegerBits_b>
-void test_Rescale(FixedPoint<int32_t, tIntegerBits_a> a) {
-  FixedPoint<int32_t, tIntegerBits_b> actual = Rescale<tIntegerBits_b>(a);
-  FixedPoint<int32_t, tIntegerBits_b> expected =
-      ToFixedPoint<int32_t, tIntegerBits_b>(ToDouble(a));
+void test_Rescale(FixedPoint<std::int32_t, tIntegerBits_a> a) {
+  FixedPoint<std::int32_t, tIntegerBits_b> actual = Rescale<tIntegerBits_b>(a);
+  FixedPoint<std::int32_t, tIntegerBits_b> expected =
+      FixedPoint<std::int32_t, tIntegerBits_b>::FromDouble(ToDouble(a));
   Check(actual == expected);
 }
 
 template <int tIntegerBits_a, int tIntegerBits_b>
-void test_Rescale(const std::vector<int32_t>& testvals_int32) {
+void test_Rescale(const std::vector<std::int32_t>& testvals_int32) {
   for (auto a : testvals_int32) {
-    FixedPoint<int32_t, tIntegerBits_a> aq;
+    FixedPoint<std::int32_t, tIntegerBits_a> aq;
     aq.raw() = a;
     test_Rescale<tIntegerBits_a, tIntegerBits_b>(aq);
   }
 }
 
 template <int tIntegerBits_a, int tIntegerBits_b>
-void test_mul(FixedPoint<int32_t, tIntegerBits_a> a,
-              FixedPoint<int32_t, tIntegerBits_b> b) {
-  static const int IntegerBits_ab = tIntegerBits_a + tIntegerBits_b;
-  FixedPoint<int32_t, IntegerBits_ab> ab;
+void test_mul(FixedPoint<std::int32_t, tIntegerBits_a> a,
+              FixedPoint<std::int32_t, tIntegerBits_b> b) {
+  static const int ProductIntegerBits = tIntegerBits_a + tIntegerBits_b;
+  using ProductFixedPoint = FixedPoint<std::int32_t, ProductIntegerBits>;
+  ProductFixedPoint ab;
   ab = a * b;
   double a_double = ToDouble(a);
   double b_double = ToDouble(b);
   double ab_double = a_double * b_double;
-  FixedPoint<int32_t, IntegerBits_ab> expected =
-      ToFixedPoint<int32_t, IntegerBits_ab>(ab_double);
-  int64_t diff = int64_t(ab.raw()) - int64_t(expected.raw());
+  ProductFixedPoint expected = ProductFixedPoint::FromDouble(ab_double);
+  std::int64_t diff = std::int64_t(ab.raw()) - std::int64_t(expected.raw());
   Check(std::abs(diff) <= 1);
 }
 
 template <int tIntegerBits_a, int tIntegerBits_b>
-void test_mul(const std::vector<int32_t>& testvals_int32) {
+void test_mul(const std::vector<std::int32_t>& testvals_int32) {
   for (auto a : testvals_int32) {
     for (auto b : testvals_int32) {
-      FixedPoint<int32_t, tIntegerBits_a> aq;
-      FixedPoint<int32_t, tIntegerBits_b> bq;
+      FixedPoint<std::int32_t, tIntegerBits_a> aq;
+      FixedPoint<std::int32_t, tIntegerBits_b> bq;
       aq.raw() = a;
       bq.raw() = b;
       test_mul(aq, bq);
@@ -59,125 +342,24 @@
 }
 
 template <int tExponent, int tIntegerBits_a>
-void test_ExactMulByPot(FixedPoint<int32_t, tIntegerBits_a> a) {
+void test_ExactMulByPot(FixedPoint<std::int32_t, tIntegerBits_a> a) {
   double x = ToDouble(a) * std::pow(2.0, tExponent);
   double y = ToDouble(ExactMulByPot<tExponent>(a));
   Check(x == y);
 }
 
 template <int tExponent, int tIntegerBits_a>
-void test_ExactMulByPot(const std::vector<int32_t>& testvals_int32) {
+void test_ExactMulByPot(const std::vector<std::int32_t>& testvals_int32) {
   for (auto a : testvals_int32) {
-    FixedPoint<int32_t, tIntegerBits_a> aq;
+    FixedPoint<std::int32_t, tIntegerBits_a> aq;
     aq.raw() = a;
     test_ExactMulByPot<tExponent, tIntegerBits_a>(aq);
   }
 }
 
-void test_exp_on_interval_between_negative_one_quarter_and_0_excl(
-    FixedPoint<int32_t, 0> a) {
-  double a_double = ToDouble(a);
-  double expected = std::exp(a_double);
-  double actual =
-      ToDouble(exp_on_interval_between_negative_one_quarter_and_0_excl(a));
-  double error = expected - actual;
-  Check(std::abs(error) < 3e-7);
-}
-
-void test_exp_on_interval_between_negative_one_quarter_and_0_excl(
-    const std::vector<int32_t>& testvals_int32) {
-  for (auto a : testvals_int32) {
-    typedef FixedPoint<int32_t, 0> F;
-    F aq = SaturatingRoundingMultiplyByPOT<-3>(F::FromRaw(a)) -
-           F::ConstantPOT<-3>();
-    test_exp_on_interval_between_negative_one_quarter_and_0_excl(aq);
-  }
-}
-
-template <int tIntegerBits>
-void test_exp_on_negative_values(FixedPoint<int32_t, tIntegerBits> a) {
-  double a_double = ToDouble(a);
-  double expected = std::exp(a_double);
-  double actual = ToDouble(exp_on_negative_values(a));
-  double error = expected - actual;
-  Check(std::abs(error) < 3e-7);
-}
-
-template <int tIntegerBits>
-void test_exp_on_negative_values(const std::vector<int32_t>& testvals_int32) {
-  for (auto a : testvals_int32) {
-    if (a < 0) {
-      FixedPoint<int32_t, tIntegerBits> aq;
-      aq.raw() = a;
-      test_exp_on_negative_values(aq);
-    }
-  }
-}
-
-void test_one_minus_x_over_one_plus_x_for_x_in_0_1(FixedPoint<int32_t, 0> a) {
-  double a_double = ToDouble(a);
-  double expected = (1 - a_double) / (1 + a_double);
-  FixedPoint<int32_t, 0> retval = one_minus_x_over_one_plus_x_for_x_in_0_1(a);
-  double actual = ToDouble(retval);
-  double error = expected - actual;
-  Check(std::abs(error) < 6e-9);
-}
-
-void test_one_minus_x_over_one_plus_x_for_x_in_0_1(
-    const std::vector<int32_t>& testvals_int32) {
-  for (auto a : testvals_int32) {
-    if (a > 0) {
-      FixedPoint<int32_t, 0> aq;
-      aq.raw() = a;
-      test_one_minus_x_over_one_plus_x_for_x_in_0_1(aq);
-    }
-  }
-}
-
-template <int tIntegerBits>
-void test_tanh(FixedPoint<int32_t, tIntegerBits> a) {
-  double a_double = ToDouble(a);
-  double expected = std::tanh(a_double);
-  double actual = ToDouble(tanh(a));
-  double error = expected - actual;
-  Check(std::abs(error) < 1.5e-7);
-}
-
-template <int tIntegerBits>
-void test_tanh(const std::vector<int32_t>& testvals_int32) {
-  for (auto a : testvals_int32) {
-    FixedPoint<int32_t, tIntegerBits> aq;
-    aq.raw() = a;
-    test_tanh(aq);
-  }
-}
-
-#ifdef GEMMLOWP_NEON
-void test_int32x4(const std::vector<int32_t>& testvals_int32) {
-  size_t n = testvals_int32.size();
-  size_t n4 = n - (n % 4);
-  std::vector<int32_t> results_int32(n4);
-  std::vector<int32_t> results_int32x4(n4);
-
-  for (size_t i = 0; i < n4; i++) {
-    results_int32[i] =
-        tanh(FixedPoint<int32_t, 4>::FromRaw(testvals_int32[i])).raw();
-  }
-  for (size_t i = 0; i < n4; i++) {
-    vst1q_s32(
-        &results_int32x4[i],
-        tanh(FixedPoint<int32x4_t, 4>::FromRaw(vld1q_s32(&testvals_int32[i])))
-            .raw());
-  }
-
-  for (size_t i = 0; i < n4; i++) {
-    Check(results_int32[i] == results_int32x4[i]);
-  }
-}
-#endif  // GEMMLOWP_NEON
-
-int main() {
-  std::vector<int32_t> testvals_int32;
+// Make the list of test values to test each op against.
+std::vector<std::int32_t> MakeTestValsInt32() {
+  std::vector<std::int32_t> testvals_int32;
 
   for (int i = 0; i < 31; i++) {
     testvals_int32.push_back((1 << i) - 2);
@@ -191,23 +373,96 @@
     testvals_int32.push_back(-(1 << i) + 1);
     testvals_int32.push_back(-(1 << i) + 2);
   }
-  testvals_int32.push_back(std::numeric_limits<int32_t>::min());
-  testvals_int32.push_back(std::numeric_limits<int32_t>::min() + 1);
-  testvals_int32.push_back(std::numeric_limits<int32_t>::min() + 2);
-  testvals_int32.push_back(std::numeric_limits<int32_t>::max() - 2);
-  testvals_int32.push_back(std::numeric_limits<int32_t>::max() - 1);
-  testvals_int32.push_back(std::numeric_limits<int32_t>::max());
+  testvals_int32.push_back(std::numeric_limits<std::int32_t>::min());
+  testvals_int32.push_back(std::numeric_limits<std::int32_t>::min() + 1);
+  testvals_int32.push_back(std::numeric_limits<std::int32_t>::min() + 2);
+  testvals_int32.push_back(std::numeric_limits<std::int32_t>::max() - 2);
+  testvals_int32.push_back(std::numeric_limits<std::int32_t>::max() - 1);
+  testvals_int32.push_back(std::numeric_limits<std::int32_t>::max());
 
-  uint32_t random = 1;
+  std::mt19937 random_engine;
+  std::uniform_int_distribution<std::int32_t> uniform_distribution(
+      std::numeric_limits<std::int32_t>::min(),
+      std::numeric_limits<std::int32_t>::max());
   for (int i = 0; i < 1000; i++) {
-    random = random * 1664525 + 1013904223;
-    testvals_int32.push_back(static_cast<int32_t>(random));
+    testvals_int32.push_back(uniform_distribution(random_engine));
+  }
+
+  // SIMD tests will require the length of testvals_int32 to be a multiple
+  // of SIMD vector size.
+  while (testvals_int32.size() % SimdVectorSize) {
+    testvals_int32.push_back(0);
   }
 
   std::sort(testvals_int32.begin(), testvals_int32.end());
+  return testvals_int32;
+}
+
+}  // end anonymous namespace
+
+}  // end namespace gemmlowp
+
+int main() {
+  using namespace gemmlowp;
+
+  const std::vector<std::int32_t> testvals_int32 = MakeTestValsInt32();
+
+  for (int s = 0; s < 32; s++) {
+    TestUnaryOp(RoundingDivideByPOTOp(s), testvals_int32);
+  }
+
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-31>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-30>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-29>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-17>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-16>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-15>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-4>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-3>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-2>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<-1>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<0>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<1>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<2>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<3>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<4>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<15>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<16>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<17>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<29>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<30>(), testvals_int32);
+  TestUnaryOp(SaturatingRoundingMultiplyByPOTOp<31>(), testvals_int32);
+
+  TestUnaryOp(ExpOnIntervalBetweenNegativeOneQuarterAnd0ExclOp(),
+              testvals_int32);
+  TestUnaryOp(ExpOnNegativeValuesOp<0>(), testvals_int32);
+  TestUnaryOp(ExpOnNegativeValuesOp<1>(), testvals_int32);
+  TestUnaryOp(ExpOnNegativeValuesOp<2>(), testvals_int32);
+  TestUnaryOp(ExpOnNegativeValuesOp<3>(), testvals_int32);
+  TestUnaryOp(ExpOnNegativeValuesOp<4>(), testvals_int32);
+  TestUnaryOp(ExpOnNegativeValuesOp<5>(), testvals_int32);
+  TestUnaryOp(ExpOnNegativeValuesOp<6>(), testvals_int32);
+
+  TestUnaryOp(OneMinusXOverOnePlusXForXIn01Op(), testvals_int32);
+  TestUnaryOp(TanhOp<0>(), testvals_int32);
+  TestUnaryOp(TanhOp<1>(), testvals_int32);
+  TestUnaryOp(TanhOp<2>(), testvals_int32);
+  TestUnaryOp(TanhOp<3>(), testvals_int32);
+  TestUnaryOp(TanhOp<4>(), testvals_int32);
+  TestUnaryOp(TanhOp<5>(), testvals_int32);
+  TestUnaryOp(TanhOp<6>(), testvals_int32);
+
+  TestUnaryOp(OneOverOnePlusXForXIn01Op(), testvals_int32);
+  TestUnaryOp(LogisticOp<0>(), testvals_int32);
+  TestUnaryOp(LogisticOp<1>(), testvals_int32);
+  TestUnaryOp(LogisticOp<2>(), testvals_int32);
+  TestUnaryOp(LogisticOp<3>(), testvals_int32);
+  TestUnaryOp(LogisticOp<4>(), testvals_int32);
+  TestUnaryOp(LogisticOp<5>(), testvals_int32);
+  TestUnaryOp(LogisticOp<6>(), testvals_int32);
 
   for (auto a : testvals_int32) {
-    FixedPoint<int32_t, 4> x;
+    FixedPoint<std::int32_t, 4> x;
     x.raw() = a;
     test_convert(x);
   }
@@ -236,27 +491,5 @@
   test_ExactMulByPot<-4, 5>(testvals_int32);
   test_ExactMulByPot<-2, 6>(testvals_int32);
 
-  test_exp_on_interval_between_negative_one_quarter_and_0_excl(testvals_int32);
-
-  test_exp_on_negative_values<1>(testvals_int32);
-  test_exp_on_negative_values<2>(testvals_int32);
-  test_exp_on_negative_values<3>(testvals_int32);
-  test_exp_on_negative_values<4>(testvals_int32);
-  test_exp_on_negative_values<5>(testvals_int32);
-  test_exp_on_negative_values<6>(testvals_int32);
-
-  test_one_minus_x_over_one_plus_x_for_x_in_0_1(testvals_int32);
-
-  test_tanh<1>(testvals_int32);
-  test_tanh<2>(testvals_int32);
-  test_tanh<3>(testvals_int32);
-  test_tanh<4>(testvals_int32);
-  test_tanh<5>(testvals_int32);
-  test_tanh<6>(testvals_int32);
-
-#ifdef GEMMLOWP_NEON
-  test_int32x4(testvals_int32);
-#endif  // GEMMLOWP_NEON
-
   std::cerr << "All tests passed." << std::endl;
 }