| Computation with less than 8 bits in gemmlowp |
| ********************************************* |
| |
| |
| Introduction |
| ============ |
| |
| We assume familiarity with gemmlowp's low-precision uint8 computation |
| paradigm, which is described in doc/low-precision.txt. |
| |
| This document is about the possibility of further reducing precision |
| below 8 bits. |
| |
| That allows to get higher arithmetic throughput on some architectures, |
| at the cost of decreased accuracy. |
| |
| |
| Public interface |
| ================ |
| |
| |
| The BitDepthSetting parameter in the EightBitIntGemm interface |
| -------------------------------------------------------------- |
| |
| Accessing less-than-8-bit computation via the EightBitIntGemm is very |
| simple: EightBitIntGemm takes a BitDepthSetting enum |
| which allows to choose among a fixed set of supported bit-depth |
| combinations. |
| |
| |
| The BitDepthParams parameter in the public/gemmlowp.h interface |
| --------------------------------------------------------------- |
| |
| The public/gemmlowp.h interface exposes more extensive control over |
| quantization, by means of a BitDepthParams template parameter, |
| which is a type parameter, carrying information about: |
| 1. The LHS and RHS bit depth, which can be set arbitrarily and |
| independently; |
| 2. The 'RoundingStrategy', which is the heuristic used to choose |
| a rounding mode, based on the accumulation size (a.k.a. the |
| "depth" dimension of the Gemm). |
| Details can be seen in public/bit_depth.h. |
| |
| |
| How does BitDepth{Setting,Params} affect input/output uint8 matrix data? |
| ------------------------------------------------------------------- |
| |
| Input/output matrix data is all uint8's, ranging from 0 to 255, regardless of |
| the BitDepth{Setting,Params}. |
| |
| So the BitDepth{Setting,Params} is only an internal detail. It only means to |
| allow gemmlowp to use lower precision internally, but the input/output data |
| format is unaffected. |
| |
| As far as the API contract goes, the only thing that the |
| BitDepth{Setting,Params} does is to relax the accuracy requirement. |
| With standard 8bit/8bit computation, gemmlowp is required to return the exact |
| result as specified in doc/low-precision.txt. With lower bit depths, gemmlowp |
| is no longer required to return an exact result. |
| |
| |
| Implementation |
| ============== |
| |
| Here we refer to the 3 stages of computation as described in doc/design.txt, |
| namely: packing, computation kernel, unpacking. |
| |
| The general idea is that at the packing stage, we requantize input (Lhs/Rhs) |
| data to less-than-8-bit depths by scaling them, thus shrinking the range of |
| the packed matrix entries; for instance, if the Rhs bit depth is to be 5 bits |
| then packed Rhs matrix entries will be in the range [0 ... 31]. This then |
| allows the GEMM kernel to use narrower accumulators without risking overflow, |
| thus achieving higher arithmetic throughput. Finally, at the unpacking stage, |
| it only remains to scale the result values to compensate for the scalings |
| applied earlier. |
| |
| Let us go into more detail for each of those stages: |
| |
| |
| Packing stage |
| ------------- |
| |
| The packing stage is where most of the work specific to the BitDepthParams |
| takes place. |
| |
| Here, we have to scale input matrix values from their original range of |
| [0 ... 255] to the range specified by the BitDepthParams, which is |
| [0 ... (2^N)-1] where N is the number of bits for the matrix at hand |
| (Lhs or Rhs). For example, for a bit depth of 5 bits, we need to scale |
| down to [0 ... 31]. |
| |
| This scaling is what we call "requantization". The pedantic name matches |
| the fact that this is actually quite nontrivial to do correctly i.e. |
| in such a way that the result accuracy will be good enough for real-world |
| applications. See the section below on requantization details. |
| |
| Concretely, this work happens in PackingRegisterBlock::Pack(), which calls |
| Requantize(). This is in internal/pack.h. This code can be overridden for |
| specific architectures, see internal/pack_neon.h. |
| |
| This requantization work is costly and makes packing slower. This means |
| that, at least in our approach, less-than-8-bit computation is only |
| interesting for large-enough, square-enough GEMMs where packing is only |
| a small fraction of the overall cost. In cases where packing overhead |
| is more prevalent (highly rectangular cases), less-than-8-bit is probably |
| a waste of time as long as we treat it as an internal computation detail. |
| What might help there, might be if we shrunk the input/output data format |
| to lower memory bandwidth usage. |
| |
| |
| Computation kernel stage |
| ------------------------ |
| |
| In principle, the computation kernel stage simply doesn't have to care |
| about the bit depth at all. In fact, on architectures where we do not have |
| specific optimized kernels for less-than-8-bit cases, we simply use our |
| standard kernel there, and that's correct! |
| |
| However, while the kernel doesn't have to know about the fact that the |
| operands are on less than 8 bits, it can use that information to make |
| special optimizations that would be incorrect in the general 8-bit case |
| and become correct here thanks to the more restricted range of inputs. |
| That's the whole point of this less-than-8-bit computation idea. |
| |
| With Lhs entries guaranteed to be smaller than 2^N, and Rhs entries |
| guaranteed to be smaller than 2^M, each product is thus guaranteed to be |
| smaller than 2^(M+N). Thus, one may accumulate 2^(16-(M+N)) such products |
| and still be guaranteed that such an accumulator will be smaller than 2^16, |
| and thus can be stored as a uint16 without risking overflow. |
| |
| For example, in the L7R5 case, the Lhs enties are on 7 bits (N=7) and the |
| Rhs entries are on 5 bits (M=5), so each product fits in 12 bits and one can |
| thus accumulate 16 ( = 2^(16-12)) such products into uint16 accumulators |
| with no risk of overflow. |
| |
| This means that a computation kernel may use uint16 accumulators for |
| several loop iterations (16 in the above example), provided that it is |
| allowed to assume that inputs are in such restricted range. |
| |
| After this fixed number of loop iterations, the kernel must accumulate |
| the local uint16 accumulators back into global uint32 accumulators. |
| |
| On SIMD architectures with suitable uint16 arithmetic, this in principle |
| allows to multiply arithmetic throughput by up to 2x, since twice more |
| accumulators now fit in each SIMD vector register. This is partially offset |
| by the cost of accumulating back into global uint32 accumulators every |
| several loop iterations, but our experience on ARM NEON has been that |
| we still get quite close to a 2x speedup. See internal/kernel_neon.h, |
| specifically NEON32Kernel12x4Depth2Assuming12BitProducts. |
| |
| |
| Unpacking stage |
| --------------- |
| |
| At the unpacking stage, it only remains to scale the result values |
| to compensate for the scaling of the inputs. This is easier because |
| now we are expanding the range instead of shrinking it, so we don't |
| need to worry about ways to minimize a loss of accuracy. We simply |
| need to multiply result values by a constant fraction, rounding to nearest. |
| |
| Since the inputs were scaled by factors of (2^lhs_bits - 1)/255 and |
| (2^rhs_bits - 1)/255 respectively, the scaling of the outputs needs to be |
| by the following factor: |
| |
| 255 * 255 |
| ----------------------------------- |
| (2^lhs_bits - 1) * (2^rhs_bits - 1) |
| |
| This is done by a MultiplyByConstantFraction function, see internal/unpack.h |
| |
| |
| Requantization details |
| ====================== |
| |
| Here we go into more detail on the Requantize() function used at the packing |
| stage to requantize input matrix data. See this function in internal/pack.h. |
| |
| It depends on the bit depth and on a rounding mode, and requantizes an input |
| value in [0 ... 255] to the range [0 ... (2^N)-1] specified by the bit depth N. |
| |
| |
| Naive, bad rounding, that's plainly biased |
| ------------------------------------------ |
| |
| Naive and inaccurate ways to achieve this requantization include: |
| 1. By shifting bits rights by (8-N) bits; |
| 2. By multiplying by ((2^N) - 1) and dividing by 255. |
| |
| Both of those are biased in some way: 1. has the wrong "derivative", since it |
| approximates (((2^N) - 1) / 255) by ((2^N) / 256) ; 2. has bias since it |
| effectively implements rounding towards 0. |
| |
| In practice, both of the above requantization functions give results that are |
| too inaccurate in practice for the neural network that we tried (GoogLeNet). |
| |
| Round-to-nearest rounding: unbiased in principle but not in practice |
| -------------------------------------------------------------------- |
| |
| The simplest fix is to avoid the bias in 2. by rounding-to-nearest instead |
| of rounding towards 0. This can be achieved by doing |
| |
| dst = (src * maxval + rounding_offset) / 255; |
| |
| Where maxval = ((2^N) - 1) is the highest requantized value, and the |
| rounding_offset can be set to |
| |
| rounding_offset = 127 |
| |
| to achieve rounding-to-nearest (while the above rounding towards 0 |
| corresponded to rounding_offset = 0). |
| |
| In principle, rounding-to-nearest is unbiased and optimal in various ways. |
| |
| In practice though, our input data is not random real numbers, but |
| already-quantized 8-bit values. That means that even in the best case, there |
| would be at most 255 different possible input values; in practice, we generally |
| see the input values distributed non-uniformly in that range, so that a majority |
| of input values tend to be in a much smaller range. See test/test_data.cc. |
| |
| Having a large part of the input values in a very small finite set, means that |
| the corresponding rounding errors are also in a very small finite set, which |
| can be small enough that the mean of these rounding errors is significantly |
| different from 0. That rounding-to-nearest is "unbiased" only means that over |
| a sufficiently large set of input values, the bias would become arbitrarily |
| close to 0; here, the set of input values is effectively small enough that the |
| resulting bias is significant. |
| |
| This leads to biasing the matrix product entries, resulting in an error that |
| grows linearly with the depth dimension of the GEMM. |
| |
| |
| Probabilistic rounding: unbiased even on small finite input distributions |
| ------------------------------------------------------------------------- |
| |
| To address that, we can instead use probabilistic rounding. The idea is that |
| for instance if we have to round the value 3.8 to the nearest integer, we can |
| round it to 3 with 20% probability and to 4 with probability 80%. If that value |
| 3.8 occurs many times, the mean requantized value will thus tend to 3.8. |
| |
| This amounts to keeping the above requantization formula, |
| |
| dst = (src * maxval + rounding_offset) / 255; |
| |
| but now the rounding_offset is a random value in [0 .. 254]. |
| |
| This guarantees zero bias no matter how small the distribution of input values |
| is. |
| |
| On the other hand, the variance of the error term here is higher than with |
| rounding-to-nearest --- one can check that it is 2x higher. |
| |
| So the error term coming from the Central Limit Theorem, which grows with |
| the square root of the accumulator depth i.e. the GEMM depth, |
| will be 2x higher here. |
| |
| Still, for large enough GEMM depth, that is better than rounding-to-nearest |
| which has an error term growing linearly with the GEMM depth. |
| |
| |
| Switching between rounding-to-nearest and probabilistic rounding |
| ---------------------------------------------------------------- |
| |
| Thus, for fixed input values and bit depths, we expect that probabilistic |
| rounding will give more accurate results for large enough GEMM depths, while |
| rounding-to-nearest will be more accurate for smaller GEMM depths. |
| |
| That is why use switch between these rounding modes based on GEMM depth, |
| see ChooseRoundingMode in internal/bit_depth_util.h. |
| |
| It is based on a constant, kProbabilisticRoundingThreshold, defined |
| in internal/common.h and empirically determined. See the comment there. |
| It would be nice to better understand the statistics here and come up |
| with better heuristics for this switching. |
| |
| |
| Choice of pseudorandom number generator |
| --------------------------------------- |
| We provide two PRNGs. The first is an 8-bit Xorshift. |
| It is fast, naturally produces values ranging |
| over an interval of width 255, which is what we need here (as opposed |
| to an interval of width 256), and turns out, from empirical tests, |
| to produce better results than a linear congruential generator (LCG). |
| That's unfortunate, as a 8-bit LCG performs better (we confirmed that |
| on various ARM devices) but we need as perfect un-biased-ness as we can |
| get. |
| |
| The second is an "add-mod" sequence generator, which generates |
| non-random values in the sequence x = (x + 97) % 255. This |
| generates a low-discrepancy sequence that minimizes the "clumpiness" |
| of the random offsets (Thus, for example, quantizing a 3x3 matrix will |
| have a maximum additive error of about 200 from the random offsets). |
| While not random, this sequence performs well empirically for many |
| quantizations. (For information about why 97 is a good value, see |
| https://en.wikipedia.org/wiki/Low-discrepancy_sequence#Additive_recurrence |
| and http://mollwollfumble.blogspot.com/2011/03/subrandom-numbers.html |
| 97/255 = 0.38; 0.382 is the best choice. For discrete numbers, |
| the choice must be relatively prime to the modulus. 97 is prime, |
| so it is safely relatively prime to 255. 107 is another near-optimal |
| choice. |
| |
| The low-discrepancy sequence generator is the default. |
| |
| More details and results are given in a comment on the default |
| PRNG in internal/pack.h. Interested users can change the |
| PRNG used by setting DefaultRoundingGenerator in bit_depth_util.h. |