docs/source/quantization-accuracy-debugging.rst - platform/external/pytorch - Git at Google

 Quantization Accuracy Debugging
 -------------------------------

 This document provides high level strategies for improving quantization
 accuracy. If a quantized model has error compared to the original model,
 we can categorize the error into:

 1. **data insensitive error** - caused by intrinsic model quantization error,
    large portion of input data has large error
 2. **data sensitive error** - caused by outlier input data, small
    portion of input data has large error
 3. **implementation error** - quantized kernel is not matching reference implementation

 Data insensitive error
 ~~~~~~~~~~~~~~~~~~~~~~

 General tips
 ^^^^^^^^^^^^

 1. For PTQ, ensure that the data you are calibrating with is representative
    of your dataset. For example, for a classification problem a general
    guideline is to have multiple samples in every category, and the overall
    number of samples should be at least 100. There is no penalty for
    calibrating with more data other than calibration time.
 2. If your model has Conv-BN or Linear-BN patterns, consider fusing them.
    If you are using FX graph mode quantization, this is done automatically
    by the workflow. If you are using Eager mode quantization, you can do
    this manually with the ``torch.ao.quantization.fuse_modules`` API.
 3. Increase the precision of dtype of the problematic ops. Usually, fp32
    will have the highest accuracy, followed by fp16, followed by dynamically
    quantized int8, followed by statically quantized int8.

    1. Note: this is trading off performance for accuracy.
    2. Note: availability of kernels per dtype per op can vary by backend.
    3. Note: dtype conversions add an additional performance cost. For example,
       ``fp32_op -> quant -> int8_op -> dequant -> fp32_op -> quant -> int8_op -> dequant``
       will have a performance penalty compared to
       ``fp32_op -> fp32_op -> quant -> int8_op -> int8_op -> dequant``
       because of a higher number of required dtype conversions.

 4. If you are using PTQ, consider using QAT to recover some of the accuracy loss
    from quantization.

 Int8 quantization tips
 ^^^^^^^^^^^^^^^^^^^^^^

 1. If you are using per-tensor weight quantization, consider using per-channel
    weight quantization.
 2. If you are doing inference on `fbgemm`, ensure that you set the `reduce_range`
    argument to `False` if your CPU is Cooperlake or newer, and to `True` otherwise.
 3. Audit the input activation distribution variation across different samples.
    If this variation is high, the layer may be suitable for dynamic quantization
    but not static quantization.

 Data sensitive error
 ~~~~~~~~~~~~~~~~~~~~

 If you are using static quantization and a small portion of your input data is
 resulting in high quantization error, you can try:

 1. Adjust your calibration dataset to make it more representative of your
    inference dataset.
 2. Manually inspect (using Numeric Suite) which layers have high quantization
    error. For these layers, consider leaving them in floating point or adjusting
    the observer settings to choose a better scale and zero_point.


 Implementation error
 ~~~~~~~~~~~~~~~~~~~~

 If you are using PyTorch quantization with your own backend
 you may see differences between the reference implementation of an
 operation (such as ``dequant -> op_fp32 -> quant``) and the quantized implementation
 (such as `op_int8`) of the op on the target hardware. This could mean one of two things:

 1. the differences (usually small) are expected due to specific behavior of
    the target kernel on the target hardware compared to fp32/cpu. An example of this
    is accumulating in an integer dtype. Unless the kernel guarantees bitwise
    equivalency with the reference implementation, this is expected.
 2. the kernel on the target hardware has an accuracy issue. In this case, reach
    out to the kernel developer.

 Numerical Debugging Tooling (prototype)
 ---------------------------------------

 .. toctree::
     :hidden:

     torch.ao.ns._numeric_suite
     torch.ao.ns._numeric_suite_fx

 .. warning ::
      Numerical debugging tooling is early prototype and subject to change.

 * :ref:`torch_ao_ns_numeric_suite`
   Eager mode numeric suite
 * :ref:`torch_ao_ns_numeric_suite_fx`
   FX numeric suite
	Quantization Accuracy Debugging
	-------------------------------

	This document provides high level strategies for improving quantization
	accuracy. If a quantized model has error compared to the original model,
	we can categorize the error into:

	1. data insensitive error - caused by intrinsic model quantization error,
	large portion of input data has large error
	2. data sensitive error - caused by outlier input data, small
	portion of input data has large error
	3. implementation error - quantized kernel is not matching reference implementation

	Data insensitive error
	~~~~~~~~~~~~~~~~~~~~~~

	General tips
	^^^^^^^^^^^^

	1. For PTQ, ensure that the data you are calibrating with is representative
	of your dataset. For example, for a classification problem a general
	guideline is to have multiple samples in every category, and the overall
	number of samples should be at least 100. There is no penalty for
	calibrating with more data other than calibration time.
	2. If your model has Conv-BN or Linear-BN patterns, consider fusing them.
	If you are using FX graph mode quantization, this is done automatically
	by the workflow. If you are using Eager mode quantization, you can do
	this manually with the ``torch.ao.quantization.fuse_modules`` API.
	3. Increase the precision of dtype of the problematic ops. Usually, fp32
	will have the highest accuracy, followed by fp16, followed by dynamically
	quantized int8, followed by statically quantized int8.

	1. Note: this is trading off performance for accuracy.
	2. Note: availability of kernels per dtype per op can vary by backend.
	3. Note: dtype conversions add an additional performance cost. For example,
	``fp32_op -> quant -> int8_op -> dequant -> fp32_op -> quant -> int8_op -> dequant``
	will have a performance penalty compared to
	``fp32_op -> fp32_op -> quant -> int8_op -> int8_op -> dequant``
	because of a higher number of required dtype conversions.

	4. If you are using PTQ, consider using QAT to recover some of the accuracy loss
	from quantization.

	Int8 quantization tips
	^^^^^^^^^^^^^^^^^^^^^^

	1. If you are using per-tensor weight quantization, consider using per-channel
	weight quantization.
	2. If you are doing inference on `fbgemm`, ensure that you set the `reduce_range`
	argument to `False` if your CPU is Cooperlake or newer, and to `True` otherwise.
	3. Audit the input activation distribution variation across different samples.
	If this variation is high, the layer may be suitable for dynamic quantization
	but not static quantization.

	Data sensitive error
	~~~~~~~~~~~~~~~~~~~~

	If you are using static quantization and a small portion of your input data is
	resulting in high quantization error, you can try:

	1. Adjust your calibration dataset to make it more representative of your
	inference dataset.
	2. Manually inspect (using Numeric Suite) which layers have high quantization
	error. For these layers, consider leaving them in floating point or adjusting
	the observer settings to choose a better scale and zero_point.


	Implementation error
	~~~~~~~~~~~~~~~~~~~~

	If you are using PyTorch quantization with your own backend
	you may see differences between the reference implementation of an
	operation (such as ``dequant -> op_fp32 -> quant``) and the quantized implementation
	(such as `op_int8`) of the op on the target hardware. This could mean one of two things:

	1. the differences (usually small) are expected due to specific behavior of
	the target kernel on the target hardware compared to fp32/cpu. An example of this
	is accumulating in an integer dtype. Unless the kernel guarantees bitwise
	equivalency with the reference implementation, this is expected.
	2. the kernel on the target hardware has an accuracy issue. In this case, reach
	out to the kernel developer.

	Numerical Debugging Tooling (prototype)
	---------------------------------------

	.. toctree::
	:hidden:

	torch.ao.ns._numeric_suite
	torch.ao.ns._numeric_suite_fx

	.. warning ::
	Numerical debugging tooling is early prototype and subject to change.

	* :ref:`torch_ao_ns_numeric_suite`
	Eager mode numeric suite
	* :ref:`torch_ao_ns_numeric_suite_fx`
	FX numeric suite