[CUBLAS][TF32][CUDNN] Update numerical_accuracy.rst (#79537) CC @mruberry @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/79537 Approved by: https://github.com/ngimel, https://github.com/mruberry

commit: d892d5d6829c315ba9b5038b8796e1c96a54f9b5 [log] [tgz]
author: Eddie Yan <eddiey@nvidia.com> Wed Sep 07 18:30:23 2022 +0000
committer: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com> Wed Sep 07 18:30:26 2022 +0000
tree: 8f8b6f352db84642d8052e89a6cb938c93cb1ce0
parent: acb4a09628284201281e262aaee58e3dc6be9c2b [diff]
diff --git a/docs/source/notes/numerical_accuracy.rst b/docs/source/notes/numerical_accuracy.rst
index c952fb1..b1d05f9 100644
--- a/docs/source/notes/numerical_accuracy.rst
+++ b/docs/source/notes/numerical_accuracy.rst

@@ -54,16 +54,14 @@
 TensorFloat-32(TF32) on Nvidia Ampere devices
 ---------------------------------------------
 
-On Ampere Nvidia GPUs, PyTorch by default uses TensorFloat32 (TF32) to speed up mathematically
-intensive operations, in particular matrix multiplications and convolutions. When operation is performed
-using TF32 tensor cores, only the first 10 bits of the input mantissa are read. This leads to less accurate
-results, and surprising results such as multiplying a matrix by identity matrix produces
-results that are different from the input.
-Most neural network workloads have the same convergence behavior when using tf32 as they have
-with fp32, however, if better accuracy is desired, TF32 can be turned off with
-``torch.backends.cuda.matmul.allow_tf32 = False``
+On Ampere Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions.
+When an operation is performed using TF32 tensor cores, only the first 10 bits of the input mantissa are read.
+This may reduce accuracy and produce surprising results (e.g., multiplying a matrix by the identity matrix may produce results that are different from the input).
+By default, TF32 tensor cores are disabled for matrix multiplications and enabled for convolutions, although most neural network workloads have the same convergence behavior when using TF32 as they have with fp32.
+We recommend enabling TF32 tensor cores for matrix multiplications with ``torch.backends.cuda.matmul.allow_tf32 = True`` if your network does not need full float32 precision.
+If your network needs full float32 precision for both matrix multiplications and convolutions, then TF32 tensor cores can also be disabled for convolutions with ``torch.backends.cudnn.allow_tf32 = False``.
 
-For more information see :ref:`TensorFloat32<tf32_on_ampere>`
+For more information see :ref:`TensorFloat32<tf32_on_ampere>`.
 
 Reduced Precision Reduction for FP16 GEMMs
 ------------------------------------------
commit	d892d5d6829c315ba9b5038b8796e1c96a54f9b5	[log] [tgz]
author	Eddie Yan <eddiey@nvidia.com>	Wed Sep 07 18:30:23 2022 +0000
committer	PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>	Wed Sep 07 18:30:26 2022 +0000
tree	8f8b6f352db84642d8052e89a6cb938c93cb1ce0
parent	acb4a09628284201281e262aaee58e3dc6be9c2b [diff]