| Quantization Accuracy Debugging | 
 | ------------------------------- | 
 |  | 
 | This document provides high level strategies for improving quantization | 
 | accuracy. If a quantized model has error compared to the original model, | 
 | we can categorize the error into: | 
 |  | 
 | 1. **data insensitive error** - caused by intrinsic model quantization error, | 
 |    large portion of input data has large error | 
 | 2. **data sensitive error** - caused by outlier input data, small | 
 |    portion of input data has large error | 
 | 3. **implementation error** - quantized kernel is not matching reference implementation | 
 |  | 
 | Data insensitive error | 
 | ~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | General tips | 
 | ^^^^^^^^^^^^ | 
 |  | 
 | 1. For PTQ, ensure that the data you are calibrating with is representative | 
 |    of your dataset. For example, for a classification problem a general | 
 |    guideline is to have multiple samples in every category, and the overall | 
 |    number of samples should be at least 100. There is no penalty for | 
 |    calibrating with more data other than calibration time. | 
 | 2. If your model has Conv-BN or Linear-BN patterns, consider fusing them. | 
 |    If you are using FX graph mode quantization, this is done automatically | 
 |    by the workflow. If you are using Eager mode quantization, you can do | 
 |    this manually with the ``torch.ao.quantization.fuse_modules`` API. | 
 | 3. Increase the precision of dtype of the problematic ops. Usually, fp32 | 
 |    will have the highest accuracy, followed by fp16, followed by dynamically | 
 |    quantized int8, followed by statically quantized int8. | 
 |  | 
 |    1. Note: this is trading off performance for accuracy. | 
 |    2. Note: availability of kernels per dtype per op can vary by backend. | 
 |    3. Note: dtype conversions add an additional performance cost. For example, | 
 |       ``fp32_op -> quant -> int8_op -> dequant -> fp32_op -> quant -> int8_op -> dequant`` | 
 |       will have a performance penalty compared to | 
 |       ``fp32_op -> fp32_op -> quant -> int8_op -> int8_op -> dequant`` | 
 |       because of a higher number of required dtype conversions. | 
 |  | 
 | 4. If you are using PTQ, consider using QAT to recover some of the accuracy loss | 
 |    from quantization. | 
 |  | 
 | Int8 quantization tips | 
 | ^^^^^^^^^^^^^^^^^^^^^^ | 
 |  | 
 | 1. If you are using per-tensor weight quantization, consider using per-channel | 
 |    weight quantization. | 
 | 2. If you are doing inference on `fbgemm`, ensure that you set the `reduce_range` | 
 |    argument to `False` if your CPU is Cooperlake or newer, and to `True` otherwise. | 
 | 3. Audit the input activation distribution variation across different samples. | 
 |    If this variation is high, the layer may be suitable for dynamic quantization | 
 |    but not static quantization. | 
 |  | 
 | Data sensitive error | 
 | ~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | If you are using static quantization and a small portion of your input data is | 
 | resulting in high quantization error, you can try: | 
 |  | 
 | 1. Adjust your calibration dataset to make it more representative of your | 
 |    inference dataset. | 
 | 2. Manually inspect (using Numeric Suite) which layers have high quantization | 
 |    error. For these layers, consider leaving them in floating point or adjusting | 
 |    the observer settings to choose a better scale and zero_point. | 
 |  | 
 |  | 
 | Implementation error | 
 | ~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | If you are using PyTorch quantization with your own backend | 
 | you may see differences between the reference implementation of an | 
 | operation (such as ``dequant -> op_fp32 -> quant``) and the quantized implementation | 
 | (such as `op_int8`) of the op on the target hardware. This could mean one of two things: | 
 |  | 
 | 1. the differences (usually small) are expected due to specific behavior of | 
 |    the target kernel on the target hardware compared to fp32/cpu. An example of this | 
 |    is accumulating in an integer dtype. Unless the kernel guarantees bitwise | 
 |    equivalency with the reference implementation, this is expected. | 
 | 2. the kernel on the target hardware has an accuracy issue. In this case, reach | 
 |    out to the kernel developer. | 
 |  | 
 | Numerical Debugging Tooling (prototype) | 
 | --------------------------------------- | 
 |  | 
 | .. toctree:: | 
 |     :hidden: | 
 |  | 
 |     torch.ao.ns._numeric_suite | 
 |     torch.ao.ns._numeric_suite_fx | 
 |  | 
 | .. warning :: | 
 |      Numerical debugging tooling is early prototype and subject to change. | 
 |  | 
 | * :ref:`torch_ao_ns_numeric_suite` | 
 |   Eager mode numeric suite | 
 | * :ref:`torch_ao_ns_numeric_suite_fx` | 
 |   FX numeric suite |