doc/low-precision.txt - platform/external/gemmlowp - Git at Google

       The low-precision paradigm in gemmlowp, and how it's implemented
       ****************************************************************


 Introduction
 ============

 "Low-precision" means that the input and output matrix entries are integers
 on at most 8 bits. The scalar type is uint8_t.

 This isn't the same as just doing plain matrix arithmetic over uint8_t,
 because that would overflow. To avoid overflow, we internally accumulate
 results on more than 8 bits, and at the end we keep only some significant
 8 bits. This relies on the caller providing suitable offset/multiplier/shift
 parameters, which effectively govern how we extract some significant 8 bit
 from our more-than-8bit temporary accumulators.

 Gemmlowp supports further reducing precision below 8 bits. That is not
 the subject of this document; for that, refer to doc/less-than-8-bit.txt.


 The low-precision paradigm
 ==========================

 gemmlowp is an implementation of the EightBitIntGemm paradigm, where quantized
 matrix multiplication takes the following input parameters:
   - the lhs matrix of uint8 quantized values
   - the rhs matrix of uint8 quantized values
   - the following int32 "quantization parameters", which control how the
     uint8 quantized values in the matrices are to be interpreted during the
     matrix computation:
     - lhs_offset
     - rhs_offset
     - result_offset
     - result_mult_int
     - result_shift

 The mathematical expression to be computed is the result of the following steps:
   1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
   2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
   3. Compute the int32 matrix product of the resulting lhs times rhs.
   4. Add result_offset to each entry of the result.
   5. Multiply each entry of the result by the following fraction, and round
      to the nearest integer:

                         result_mult_int
                         ---------------                                   (1)
                         2^result_shift

   6. Clamp the resulting int32 values to the [0..255] range and cast to uint8.

 Thus the caller of this interface is expected to have precomputed suitable
 quantization parameters

 The rationale for these parameters is as follows:
   - The three offsets may improve quantization accuracy in cases where the
     range of values is limited, and they also conveniently allow to reduce all
     eight combinations of signednesses to just the unsigned*unsigned->unsigned
     case. One may at first glance worry that these offsets would incur
     substantial overhead to the GEMM computation, but that is actually not the
     case thanks to a trick described below (see "Efficient handling of
     offsets").
   - The result_mult_int and result_shift parameters allow approximating
     arbitrarily closely any real multiplier, as a fraction of the form given
     in (1) above, without using floating-point arithmetic and without using
     a division instruction (only a right shift).


 Efficient handling of offsets
 =============================

 At first glance it may seem like the above-described quantized computation
 scheme requires adding the lhs_offset and rhs_offset to each of the lhs and
 rhs matrix entries.

 Doing that in the GEMM kernel would incur substantial overhead:
   - It would mean extra arithmetic work in the GEMM kernel;
   - It would require storing the lhs_offset and rhs_offset in registers,
     which would eat into the register space available for the rest of the
     GEMM kernel.

 One may then consider adding the lhs_offset and rhs_offset once and for all
 to lhs and rhs blocks, in a GEMM implementation operating on one lhs block
 and one rhs block at a time. However, doing so would require storing lhs and
 rhs blocks in 32 bit (or at least in 16 bit in real-world cases), which would
 partially negate the memory bandwidth benefits of low-precision computation.

 Fortunately, there is another way to handle these offsets that has none of
 the costs of the approaches described above. The idea is as follows.

 Let P denote the matrix shaped like lhs, but filled with 1's.
 Let Q denote the matrix shaped like rhs, but filled with 1's.

 Adding lhs_offset to each entry of lhs, means adding lhs_offset * P to lhs.
 Adding rhs_offset to each entry of rhs, means adding rhs_offset * Q to rhs.

 Thus, as far as handling lhs_offset and rhs_offset goes, the matrix product to be
 computed is:

   (lhs + lhs_offset * P) * (rhs + rhs_offset * Q)

 Expanding this (using distributivity of matrix multiplication over addition),
 we see that the above product is equal to the following sum of 4 terms:

     lhs * rhs                                                             (2)
   + lhs_offset * P * rhs
   + lhs * rhs_offset * Q
   + lhs_offset * rhs_offset * P * Q

 The first term, lhs * rhs, is just the matrix multiplication ignoring the
 offsets, i.e. as if lhs_offset==rhs_offset==0. Our claim here is that this
 is all what we have to compute in the GEMM kernel.

 In the second term, lhs_offset * P * rhs, notice that since P is filled
 with 1's, P * rhs has all its rows equal to each other, and equal to the
 row-vector of sums of all the entries in each column of rhs.

 Thus, we can compute the second term, lhs_offset * P * rhs, by summing
 each column of rhs. This produces a single row-vector, and in order to add the
 second term, we simply need to add this row-vector (multiplied by lhs_offset)
 to each row of the result. This is just a rank one update of the result
 (equivalently, the second term is a rank one matrix), and we can efficiently
 store it as a single vector.

 The third term, lhs * rhs_offset * Q, is entirely similar to the second one,
 and can be similarly computed by summing each row of lhs, storing this in a
 single column-vector, and later multiplying these sums by rhs_offset.

 The fourth term is a single constant, repeated into all the entries of the
 matrix. The matrix P * Q is filled with the single constant value 'depth'
 (the depth the the matrix product i.e. the number of columns of the lhs).
 Thus the fourth term is simply the rank zero update adding this constant
 to each matrix entry:

   lhs_offset * rhs_offset * depth


 Implementation of this technique in gemmlowp
 ============================================

 In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
 to prepare them for efficient repeated traversal by the kernel), we compute
 the sum of each row of the lhs block and the sum of each column of the rhs
 block.

 See in internal/pack.h, in the PackedSideBlock class, the following member:

   // Handle on the additional buffer backing the vector of sums of slices
   // associated with this block. Owned.
   Allocator::Handle sums_of_each_slice_handle_;

 sums_of_each_slice_handle_ is the handle to the buffer allocated to store
 the vector containing sums of rows of lhs, or of sums of columns of rhs.

 After these rank one updates have been computed at the packing stage, they are
 ignored at the compute kernel stage, since that stage is only concerned
 with the first of the four terms in (2); they are only used at the unpacking
 stage. See the default/reference implementation, UnpackResultImpl, in
 internal/unpack.h.
	The low-precision paradigm in gemmlowp, and how it's implemented
	****************************************************************


	Introduction
	============

	"Low-precision" means that the input and output matrix entries are integers
	on at most 8 bits. The scalar type is uint8_t.

	This isn't the same as just doing plain matrix arithmetic over uint8_t,
	because that would overflow. To avoid overflow, we internally accumulate
	results on more than 8 bits, and at the end we keep only some significant
	8 bits. This relies on the caller providing suitable offset/multiplier/shift
	parameters, which effectively govern how we extract some significant 8 bit
	from our more-than-8bit temporary accumulators.

	Gemmlowp supports further reducing precision below 8 bits. That is not
	the subject of this document; for that, refer to doc/less-than-8-bit.txt.


	The low-precision paradigm
	==========================

	gemmlowp is an implementation of the EightBitIntGemm paradigm, where quantized
	matrix multiplication takes the following input parameters:
	- the lhs matrix of uint8 quantized values
	- the rhs matrix of uint8 quantized values
	- the following int32 "quantization parameters", which control how the
	uint8 quantized values in the matrices are to be interpreted during the
	matrix computation:
	- lhs_offset
	- rhs_offset
	- result_offset
	- result_mult_int
	- result_shift

	The mathematical expression to be computed is the result of the following steps:
	1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
	2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
	3. Compute the int32 matrix product of the resulting lhs times rhs.
	4. Add result_offset to each entry of the result.
	5. Multiply each entry of the result by the following fraction, and round
	to the nearest integer:

	result_mult_int
	--------------- (1)
	2^result_shift

	6. Clamp the resulting int32 values to the [0..255] range and cast to uint8.

	Thus the caller of this interface is expected to have precomputed suitable
	quantization parameters

	The rationale for these parameters is as follows:
	- The three offsets may improve quantization accuracy in cases where the
	range of values is limited, and they also conveniently allow to reduce all
	eight combinations of signednesses to just the unsigned*unsigned->unsigned
	case. One may at first glance worry that these offsets would incur
	substantial overhead to the GEMM computation, but that is actually not the
	case thanks to a trick described below (see "Efficient handling of
	offsets").
	- The result_mult_int and result_shift parameters allow approximating
	arbitrarily closely any real multiplier, as a fraction of the form given
	in (1) above, without using floating-point arithmetic and without using
	a division instruction (only a right shift).


	Efficient handling of offsets
	=============================

	At first glance it may seem like the above-described quantized computation
	scheme requires adding the lhs_offset and rhs_offset to each of the lhs and
	rhs matrix entries.

	Doing that in the GEMM kernel would incur substantial overhead:
	- It would mean extra arithmetic work in the GEMM kernel;
	- It would require storing the lhs_offset and rhs_offset in registers,
	which would eat into the register space available for the rest of the
	GEMM kernel.

	One may then consider adding the lhs_offset and rhs_offset once and for all
	to lhs and rhs blocks, in a GEMM implementation operating on one lhs block
	and one rhs block at a time. However, doing so would require storing lhs and
	rhs blocks in 32 bit (or at least in 16 bit in real-world cases), which would
	partially negate the memory bandwidth benefits of low-precision computation.

	Fortunately, there is another way to handle these offsets that has none of
	the costs of the approaches described above. The idea is as follows.

	Let P denote the matrix shaped like lhs, but filled with 1's.
	Let Q denote the matrix shaped like rhs, but filled with 1's.

	Adding lhs_offset to each entry of lhs, means adding lhs_offset * P to lhs.
	Adding rhs_offset to each entry of rhs, means adding rhs_offset * Q to rhs.

	Thus, as far as handling lhs_offset and rhs_offset goes, the matrix product to be
	computed is:

	(lhs + lhs_offset * P) * (rhs + rhs_offset * Q)

	Expanding this (using distributivity of matrix multiplication over addition),
	we see that the above product is equal to the following sum of 4 terms:

	lhs * rhs (2)
	+ lhs_offset * P * rhs
	+ lhs * rhs_offset * Q
	+ lhs_offset * rhs_offset * P * Q

	The first term, lhs * rhs, is just the matrix multiplication ignoring the
	offsets, i.e. as if lhs_offset==rhs_offset==0. Our claim here is that this
	is all what we have to compute in the GEMM kernel.

	In the second term, lhs_offset * P * rhs, notice that since P is filled
	with 1's, P * rhs has all its rows equal to each other, and equal to the
	row-vector of sums of all the entries in each column of rhs.

	Thus, we can compute the second term, lhs_offset * P * rhs, by summing
	each column of rhs. This produces a single row-vector, and in order to add the
	second term, we simply need to add this row-vector (multiplied by lhs_offset)
	to each row of the result. This is just a rank one update of the result
	(equivalently, the second term is a rank one matrix), and we can efficiently
	store it as a single vector.

	The third term, lhs * rhs_offset * Q, is entirely similar to the second one,
	and can be similarly computed by summing each row of lhs, storing this in a
	single column-vector, and later multiplying these sums by rhs_offset.

	The fourth term is a single constant, repeated into all the entries of the
	matrix. The matrix P * Q is filled with the single constant value 'depth'
	(the depth the the matrix product i.e. the number of columns of the lhs).
	Thus the fourth term is simply the rank zero update adding this constant
	to each matrix entry:

	lhs_offset * rhs_offset * depth


	Implementation of this technique in gemmlowp
	============================================

	In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
	to prepare them for efficient repeated traversal by the kernel), we compute
	the sum of each row of the lhs block and the sum of each column of the rhs
	block.

	See in internal/pack.h, in the PackedSideBlock class, the following member:

	// Handle on the additional buffer backing the vector of sums of slices
	// associated with this block. Owned.
	Allocator::Handle sums_of_each_slice_handle_;

	sums_of_each_slice_handle_ is the handle to the buffer allocated to store
	the vector containing sums of rows of lhs, or of sums of columns of rhs.

	After these rank one updates have been computed at the packing stage, they are
	ignored at the compute kernel stage, since that stage is only concerned
	with the first of the four terms in (2); they are only used at the unpacking
	stage. See the default/reference implementation, UnpackResultImpl, in
	internal/unpack.h.