Fix many doc issues (#37099)

Summary:
Fix https://github.com/pytorch/pytorch/issues/35643 https://github.com/pytorch/pytorch/issues/37063 https://github.com/pytorch/pytorch/issues/36307 https://github.com/pytorch/pytorch/issues/35861 https://github.com/pytorch/pytorch/issues/35299 https://github.com/pytorch/pytorch/issues/23108 https://github.com/pytorch/pytorch/issues/4661

Just a bunch of small updates on the doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37099

Differential Revision: D21185713

Pulled By: albanD

fbshipit-source-id: 4ac06d6709dc0da6109a6ad3daae75667ee5863e
diff --git a/docs/source/notes/extending.rst b/docs/source/notes/extending.rst
index 020c213..b175aa5 100644
--- a/docs/source/notes/extending.rst
+++ b/docs/source/notes/extending.rst
@@ -121,6 +121,11 @@
     track history. So if ``backward`` is implemented with differentiable
     operations, (e.g., invocation of another custom
     :class:`~torch.autograd.function`), higher order derivatives will work.
+    In this case, the Tensors saved with ``save_for_backward`` can also be used
+    in the backward and have gradients flowing back but Tensors saved in the ``ctx``
+    won't have gradients flowing back for them.
+    If you need gradients to flow back for a Tensor saved in the ``ctx``, you should
+    make it an output of the custom ``Function`` and save it with ``save_for_backward``.
 
 You probably want to check if the backward method you implemented actually
 computes the derivatives of your function. It is possible by comparing with
@@ -136,6 +141,8 @@
     print(test)
 
 See :ref:`grad-check` for more details on finite-difference gradient comparisons.
+If your function is used in higher order derivatives (differentiating the backward pass) you
+can use the ``gradgradcheck`` function from the same package to check higher order derivatives.
 
 Extending :mod:`torch.nn`
 -------------------------
diff --git a/torch/autograd/__init__.py b/torch/autograd/__init__.py
index c1b078f..a2539e3 100644
--- a/torch/autograd/__init__.py
+++ b/torch/autograd/__init__.py
@@ -57,6 +57,13 @@
     This function accumulates gradients in the leaves - you might need to zero
     them before calling it.
 
+    .. note::
+        Using this method with ``create_graph=True`` will create a reference cycle
+        between the parameter and its gradient which can cause a memory leak.
+        We recommend using ``autograd.grad`` when creating the graph to avoid this.
+        If you have to use this function, make sure to reset the ``.grad`` fields of your
+        parameters to ``None`` after use to break the cycle and avoid the leak.
+
     Arguments:
         tensors (sequence of Tensor): Tensors of which the derivative will be
             computed.
diff --git a/torch/distributions/laplace.py b/torch/distributions/laplace.py
index 748ab77..d7ec01c 100644
--- a/torch/distributions/laplace.py
+++ b/torch/distributions/laplace.py
@@ -7,7 +7,7 @@
 
 class Laplace(Distribution):
     r"""
-    Creates a Laplace distribution parameterized by :attr:`loc` and :attr:'scale'.
+    Creates a Laplace distribution parameterized by :attr:`loc` and :attr:`scale`.
 
     Example::
 
diff --git a/torch/functional.py b/torch/functional.py
index 54bca28..c2e3a56 100644
--- a/torch/functional.py
+++ b/torch/functional.py
@@ -238,6 +238,12 @@
            the ellipsis dimensions are at the beginning of the output.
     operands (Tensor): The operands to compute the Einstein sum of.
 
+.. note::
+
+    This function does not optimize the given expression, so a different formula for the same computation may
+    run faster or consume less memory. Projects like opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/)
+    can optimize the formula for you.
+
 Examples::
 
     >>> x = torch.randn(5)
diff --git a/torch/nn/functional.py b/torch/nn/functional.py
index b669886..0261943 100644
--- a/torch/nn/functional.py
+++ b/torch/nn/functional.py
@@ -1628,7 +1628,7 @@
     # type: (Tensor, Tensor, Tensor, Optional[Tensor]) -> Tensor
     r"""
     Applies a bilinear transformation to the incoming data:
-    :math:`y = x_1 A x_2 + b`
+    :math:`y = x_1^T A x_2 + b`
 
     Shape:
 
@@ -3679,7 +3679,7 @@
     tensor.
 
     .. warning::
-        Currently, only 4-D output tensors (batched image-like tensors) are
+        Currently, only 3-D output tensors (unfolded batched image-like tensors) are
         supported.
 
     See :class:`torch.nn.Fold` for details
diff --git a/torch/nn/modules/linear.py b/torch/nn/modules/linear.py
index 8cdc6c0..44f7339 100644
--- a/torch/nn/modules/linear.py
+++ b/torch/nn/modules/linear.py
@@ -94,7 +94,7 @@
 
 class Bilinear(Module):
     r"""Applies a bilinear transformation to the incoming data:
-    :math:`y = x_1 A x_2 + b`
+    :math:`y = x_1^T A x_2 + b`
 
     Args:
         in1_features: size of each first input sample
diff --git a/torch/nn/modules/module.py b/torch/nn/modules/module.py
index 7f9a94b..1e6e121 100644
--- a/torch/nn/modules/module.py
+++ b/torch/nn/modules/module.py
@@ -453,6 +453,15 @@
     def register_backward_hook(self, hook):
         r"""Registers a backward hook on the module.
 
+        .. warning ::
+
+            The current implementation will not have the presented behavior
+            for complex :class:`Module` that perform many operations.
+            In some failure cases, :attr:`grad_input` and :attr:`grad_output` will only
+            contain the gradients for a subset of the inputs and outputs.
+            For such :class:`Module`, you should use :func:`torch.Tensor.register_hook`
+            directly on a specific input or output to get the required gradients.
+
         The hook will be called every time the gradients with respect to module
         inputs are computed. The hook should have the following signature::
 
@@ -462,22 +471,14 @@
         module has multiple inputs or outputs. The hook should not modify its
         arguments, but it can optionally return a new gradient with respect to
         input that will be used in place of :attr:`grad_input` in subsequent
-        computations.
+        computations. :attr:`grad_input` will only correspond to the inputs given
+        as positional arguments.
 
         Returns:
             :class:`torch.utils.hooks.RemovableHandle`:
                 a handle that can be used to remove the added hook by calling
                 ``handle.remove()``
 
-        .. warning ::
-
-            The current implementation will not have the presented behavior
-            for complex :class:`Module` that perform many operations.
-            In some failure cases, :attr:`grad_input` and :attr:`grad_output` will only
-            contain the gradients for a subset of the inputs and outputs.
-            For such :class:`Module`, you should use :func:`torch.Tensor.register_hook`
-            directly on a specific input or output to get the required gradients.
-
         """
         handle = hooks.RemovableHandle(self._backward_hooks)
         self._backward_hooks[handle.id] = hook
@@ -491,6 +492,8 @@
 
             hook(module, input) -> None or modified input
 
+        The input contains only the positional arguments given to the module.
+        Keyword arguments won't be passed to the hooks and only to the ``forward``.
         The hook can modify the input. User can either return a tuple or a
         single modified value in the hook. We will wrap the value into a tuple
         if a single value is returned(unless that value is already a tuple).
@@ -512,6 +515,8 @@
 
             hook(module, input, output) -> None or modified output
 
+        The input contains only the positional arguments given to the module.
+        Keyword arguments won't be passed to the hooks and only to the ``forward``.
         The hook can modify the output. It can modify the input inplace but
         it will not have effect on forward since this is called after
         :func:`forward` is called.
diff --git a/torch/nn/modules/pooling.py b/torch/nn/modules/pooling.py
index 7a84222..32442ba 100644
--- a/torch/nn/modules/pooling.py
+++ b/torch/nn/modules/pooling.py
@@ -517,7 +517,7 @@
         padding: implicit zero padding to be added on both sides
         ceil_mode: when True, will use `ceil` instead of `floor` to compute the output shape
         count_include_pad: when True, will include the zero-padding in the averaging calculation
-        divisor_override: if specified, it will be used as divisor, otherwise attr:`kernel_size` will be used
+        divisor_override: if specified, it will be used as divisor, otherwise :attr:`kernel_size` will be used
 
     Shape:
         - Input: :math:`(N, C, H_{in}, W_{in})`
@@ -588,7 +588,7 @@
         padding: implicit zero padding to be added on all three sides
         ceil_mode: when True, will use `ceil` instead of `floor` to compute the output shape
         count_include_pad: when True, will include the zero-padding in the averaging calculation
-        divisor_override: if specified, it will be used as divisor, otherwise attr:`kernel_size` will be used
+        divisor_override: if specified, it will be used as divisor, otherwise :attr:`kernel_size` will be used
 
     Shape:
         - Input: :math:`(N, C, D_{in}, H_{in}, W_{in})`