[CUBLAS][TF32][CUDNN] Update numerical_accuracy.rst (#79537)

CC @mruberry @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79537
Approved by: https://github.com/ngimel, https://github.com/mruberry
diff --git a/docs/source/notes/numerical_accuracy.rst b/docs/source/notes/numerical_accuracy.rst
index c952fb1..b1d05f9 100644
--- a/docs/source/notes/numerical_accuracy.rst
+++ b/docs/source/notes/numerical_accuracy.rst
@@ -54,16 +54,14 @@
 TensorFloat-32(TF32) on Nvidia Ampere devices
 ---------------------------------------------
 
-On Ampere Nvidia GPUs, PyTorch by default uses TensorFloat32 (TF32) to speed up mathematically
-intensive operations, in particular matrix multiplications and convolutions. When operation is performed
-using TF32 tensor cores, only the first 10 bits of the input mantissa are read. This leads to less accurate
-results, and surprising results such as multiplying a matrix by identity matrix produces
-results that are different from the input.
-Most neural network workloads have the same convergence behavior when using tf32 as they have
-with fp32, however, if better accuracy is desired, TF32 can be turned off with
-``torch.backends.cuda.matmul.allow_tf32 = False``
+On Ampere Nvidia GPUs, PyTorch can use TensorFloat32 (TF32) to speed up mathematically intensive operations, in particular matrix multiplications and convolutions.
+When an operation is performed using TF32 tensor cores, only the first 10 bits of the input mantissa are read.
+This may reduce accuracy and produce surprising results (e.g., multiplying a matrix by the identity matrix may produce results that are different from the input).
+By default, TF32 tensor cores are disabled for matrix multiplications and enabled for convolutions, although most neural network workloads have the same convergence behavior when using TF32 as they have with fp32.
+We recommend enabling TF32 tensor cores for matrix multiplications with ``torch.backends.cuda.matmul.allow_tf32 = True`` if your network does not need full float32 precision.
+If your network needs full float32 precision for both matrix multiplications and convolutions, then TF32 tensor cores can also be disabled for convolutions with ``torch.backends.cudnn.allow_tf32 = False``.
 
-For more information see :ref:`TensorFloat32<tf32_on_ampere>`
+For more information see :ref:`TensorFloat32<tf32_on_ampere>`.
 
 Reduced Precision Reduction for FP16 GEMMs
 ------------------------------------------