docs/source/notes/numerical_accuracy.rst - platform/external/pytorch - Git at Google

 .. _numerical_accuracy:

 Numerical accuracy
 ==================

 In modern computers, floating point numbers are represented using IEEE 754 standard.
 For more details on floating point arithmetics and IEEE 754 standard, please see
 `Floating point arithmetic <https://en.wikipedia.org/wiki/Floating-point_arithmetic>`_
 In particular, note that floating point provides limited accuracy (about 7 decimal digits
 for single precision floating point numbers, about 16 decimal digits for double precision
 floating point numbers) and that floating point addition and multiplication are not
 associative, so the order of the operations affects the results.
 Because of this, pytorch is not guaranteed
 to produce bitwise identical results for floating point computations that are
 mathematically identical. Similarly, bitwise identical results are not guaranteed across
 PyTorch releases, individual commits, or different platforms. In particular, CPU and GPU
 results can be different even for bitwise-identical inputs and even after controlling for
 the sources of randomness.

 Batched computations or slice computations
 ------------------------------------------

 Many operations in pytorch support batched computation, where the same operation is performed
 for the elements of the batches of inputs. An example of this is :meth:`torch.mm` and
 :meth:`torch.bmm`. It is possible to implement batched computation as a loop over batch elements,
 and apply the necessary math operations to the individual batch elements, for efficiency reasons
 we are not doing that, and typically perform computation for the whole batch. The mathematical
 libraries that we are calling, and pytorch internal implementations of operations can produces
 slightly different results in this case, compared to non-batched computations. In particular,
 let ``A`` and ``B`` be 3D tensors with the dimensions suitable for batched matrix multiplication.
 Then ``(A@B)[0]`` (the first element of the batched result) is not guaranteed to be bitwise
 identical to ``A[0]@B[0]`` (the matrix product of the first elements of the input batches)
 even though mathematically it's an identical computation.

 Similarly, an operation applied to a tensor slice is not guaranteed to produce results that are
 identical to the slice of the result of the same operation applied to the full tensor. E.g. let
 ``A`` be a 2-dimentional tensor. ``A.sum(-1)[0]`` is not guaranteed to be bitwise equal to
 ``A[:,0].sum()``.

 Extremal values
 ---------------

 When inputs contain large values such that intermediate results may overflow the range of the
 used datatype, the end result may overflow too, even though it is representable in the original
 datatype. E.g.:

 .. code:: python

     import torch
     a=torch.tensor([1e20, 1e20]) # fp32 type by default
     a.norm() # produces tensor(inf)
     a.double().norm() # produces tensor(1.4142e+20, dtype=torch.float64), representable in fp32

 TensorFloat-32(TF32) on Nvidia Ampere devices
 ---------------------------------------------

 On Ampere Nvidia GPUs, pytorch by default uses TensorFloat32 (TF32) to speed up mathematically
 intensive operations, in particular matrix multiplications and convolutions. When operation is performed
 using TF32 tensor cores, only the first 10 bits of the input mantissa are read. This leads to less accurate
 results, and surprising results such as multiplying a matrix by identity matrix produces
 results that are different from the input.
 Most neural network workloads have the same convergence behavior when using tf32 as they have
 with fp32, however, if better accuracy is desired, TF32 can be turned off with
 ``torch.backends.cuda.matmul.allow_tf32 = False``

 For more information see :ref:`TensorFloat32<tf32_on_ampere>`

 Reduced Precision Reduction for FP16 GEMMs
 ------------------------------------------
 Half-precision GEMM operations are typically done with intermediate accumulations (reduction) in single-precision for numerical accuracy and improved resilience to overflow. For performance, certain GPU architectures, especially more recent ones, allow a few truncations of the intermediate accumulation results to the reduced precision (e.g., half-precision). This change is often benign from the perspective of model convergence, though it may lead to unexpected results (e.g., ``inf`` values when the final result should be be representable in half-precision).
 If reduced-precision reductions are problematic, they can be turned off with
 ``torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False``

 For more information see :ref:`allow_fp16_reduced_precision_reduction<fp16reducedprecision>`
	.. _numerical_accuracy:

	Numerical accuracy
	==================

	In modern computers, floating point numbers are represented using IEEE 754 standard.
	For more details on floating point arithmetics and IEEE 754 standard, please see
	`Floating point arithmetic <https://en.wikipedia.org/wiki/Floating-point_arithmetic>`_
	In particular, note that floating point provides limited accuracy (about 7 decimal digits
	for single precision floating point numbers, about 16 decimal digits for double precision
	floating point numbers) and that floating point addition and multiplication are not
	associative, so the order of the operations affects the results.
	Because of this, pytorch is not guaranteed
	to produce bitwise identical results for floating point computations that are
	mathematically identical. Similarly, bitwise identical results are not guaranteed across
	PyTorch releases, individual commits, or different platforms. In particular, CPU and GPU
	results can be different even for bitwise-identical inputs and even after controlling for
	the sources of randomness.

	Batched computations or slice computations
	------------------------------------------

	Many operations in pytorch support batched computation, where the same operation is performed
	for the elements of the batches of inputs. An example of this is :meth:`torch.mm` and
	:meth:`torch.bmm`. It is possible to implement batched computation as a loop over batch elements,
	and apply the necessary math operations to the individual batch elements, for efficiency reasons
	we are not doing that, and typically perform computation for the whole batch. The mathematical
	libraries that we are calling, and pytorch internal implementations of operations can produces
	slightly different results in this case, compared to non-batched computations. In particular,
	let ``A`` and ``B`` be 3D tensors with the dimensions suitable for batched matrix multiplication.
	Then ``(A@B)[0]`` (the first element of the batched result) is not guaranteed to be bitwise
	identical to ``A[0]@B[0]`` (the matrix product of the first elements of the input batches)
	even though mathematically it's an identical computation.

	Similarly, an operation applied to a tensor slice is not guaranteed to produce results that are
	identical to the slice of the result of the same operation applied to the full tensor. E.g. let
	``A`` be a 2-dimentional tensor. ``A.sum(-1)[0]`` is not guaranteed to be bitwise equal to
	``A[:,0].sum()``.

	Extremal values
	---------------

	When inputs contain large values such that intermediate results may overflow the range of the
	used datatype, the end result may overflow too, even though it is representable in the original
	datatype. E.g.:

	.. code:: python

	import torch
	a=torch.tensor([1e20, 1e20]) # fp32 type by default
	a.norm() # produces tensor(inf)
	a.double().norm() # produces tensor(1.4142e+20, dtype=torch.float64), representable in fp32

	TensorFloat-32(TF32) on Nvidia Ampere devices
	---------------------------------------------

	On Ampere Nvidia GPUs, pytorch by default uses TensorFloat32 (TF32) to speed up mathematically
	intensive operations, in particular matrix multiplications and convolutions. When operation is performed
	using TF32 tensor cores, only the first 10 bits of the input mantissa are read. This leads to less accurate
	results, and surprising results such as multiplying a matrix by identity matrix produces
	results that are different from the input.
	Most neural network workloads have the same convergence behavior when using tf32 as they have
	with fp32, however, if better accuracy is desired, TF32 can be turned off with
	``torch.backends.cuda.matmul.allow_tf32 = False``

	For more information see :ref:`TensorFloat32<tf32_on_ampere>`

	Reduced Precision Reduction for FP16 GEMMs
	------------------------------------------
	Half-precision GEMM operations are typically done with intermediate accumulations (reduction) in single-precision for numerical accuracy and improved resilience to overflow. For performance, certain GPU architectures, especially more recent ones, allow a few truncations of the intermediate accumulation results to the reduced precision (e.g., half-precision). This change is often benign from the perspective of model convergence, though it may lead to unexpected results (e.g., ``inf`` values when the final result should be be representable in half-precision).
	If reduced-precision reductions are problematic, they can be turned off with
	``torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False``

	For more information see :ref:`allow_fp16_reduced_precision_reduction<fp16reducedprecision>`