Fix many doc issues (#37099)
Summary:
Fix https://github.com/pytorch/pytorch/issues/35643 https://github.com/pytorch/pytorch/issues/37063 https://github.com/pytorch/pytorch/issues/36307 https://github.com/pytorch/pytorch/issues/35861 https://github.com/pytorch/pytorch/issues/35299 https://github.com/pytorch/pytorch/issues/23108 https://github.com/pytorch/pytorch/issues/4661
Just a bunch of small updates on the doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37099
Differential Revision: D21185713
Pulled By: albanD
fbshipit-source-id: 4ac06d6709dc0da6109a6ad3daae75667ee5863e
diff --git a/docs/source/notes/extending.rst b/docs/source/notes/extending.rst
index 020c213..b175aa5 100644
--- a/docs/source/notes/extending.rst
+++ b/docs/source/notes/extending.rst
@@ -121,6 +121,11 @@
track history. So if ``backward`` is implemented with differentiable
operations, (e.g., invocation of another custom
:class:`~torch.autograd.function`), higher order derivatives will work.
+ In this case, the Tensors saved with ``save_for_backward`` can also be used
+ in the backward and have gradients flowing back but Tensors saved in the ``ctx``
+ won't have gradients flowing back for them.
+ If you need gradients to flow back for a Tensor saved in the ``ctx``, you should
+ make it an output of the custom ``Function`` and save it with ``save_for_backward``.
You probably want to check if the backward method you implemented actually
computes the derivatives of your function. It is possible by comparing with
@@ -136,6 +141,8 @@
print(test)
See :ref:`grad-check` for more details on finite-difference gradient comparisons.
+If your function is used in higher order derivatives (differentiating the backward pass) you
+can use the ``gradgradcheck`` function from the same package to check higher order derivatives.
Extending :mod:`torch.nn`
-------------------------
diff --git a/torch/autograd/__init__.py b/torch/autograd/__init__.py
index c1b078f..a2539e3 100644
--- a/torch/autograd/__init__.py
+++ b/torch/autograd/__init__.py
@@ -57,6 +57,13 @@
This function accumulates gradients in the leaves - you might need to zero
them before calling it.
+ .. note::
+ Using this method with ``create_graph=True`` will create a reference cycle
+ between the parameter and its gradient which can cause a memory leak.
+ We recommend using ``autograd.grad`` when creating the graph to avoid this.
+ If you have to use this function, make sure to reset the ``.grad`` fields of your
+ parameters to ``None`` after use to break the cycle and avoid the leak.
+
Arguments:
tensors (sequence of Tensor): Tensors of which the derivative will be
computed.
diff --git a/torch/distributions/laplace.py b/torch/distributions/laplace.py
index 748ab77..d7ec01c 100644
--- a/torch/distributions/laplace.py
+++ b/torch/distributions/laplace.py
@@ -7,7 +7,7 @@
class Laplace(Distribution):
r"""
- Creates a Laplace distribution parameterized by :attr:`loc` and :attr:'scale'.
+ Creates a Laplace distribution parameterized by :attr:`loc` and :attr:`scale`.
Example::
diff --git a/torch/functional.py b/torch/functional.py
index 54bca28..c2e3a56 100644
--- a/torch/functional.py
+++ b/torch/functional.py
@@ -238,6 +238,12 @@
the ellipsis dimensions are at the beginning of the output.
operands (Tensor): The operands to compute the Einstein sum of.
+.. note::
+
+ This function does not optimize the given expression, so a different formula for the same computation may
+ run faster or consume less memory. Projects like opt_einsum (https://optimized-einsum.readthedocs.io/en/stable/)
+ can optimize the formula for you.
+
Examples::
>>> x = torch.randn(5)
diff --git a/torch/nn/functional.py b/torch/nn/functional.py
index b669886..0261943 100644
--- a/torch/nn/functional.py
+++ b/torch/nn/functional.py
@@ -1628,7 +1628,7 @@
# type: (Tensor, Tensor, Tensor, Optional[Tensor]) -> Tensor
r"""
Applies a bilinear transformation to the incoming data:
- :math:`y = x_1 A x_2 + b`
+ :math:`y = x_1^T A x_2 + b`
Shape:
@@ -3679,7 +3679,7 @@
tensor.
.. warning::
- Currently, only 4-D output tensors (batched image-like tensors) are
+ Currently, only 3-D output tensors (unfolded batched image-like tensors) are
supported.
See :class:`torch.nn.Fold` for details
diff --git a/torch/nn/modules/linear.py b/torch/nn/modules/linear.py
index 8cdc6c0..44f7339 100644
--- a/torch/nn/modules/linear.py
+++ b/torch/nn/modules/linear.py
@@ -94,7 +94,7 @@
class Bilinear(Module):
r"""Applies a bilinear transformation to the incoming data:
- :math:`y = x_1 A x_2 + b`
+ :math:`y = x_1^T A x_2 + b`
Args:
in1_features: size of each first input sample
diff --git a/torch/nn/modules/module.py b/torch/nn/modules/module.py
index 7f9a94b..1e6e121 100644
--- a/torch/nn/modules/module.py
+++ b/torch/nn/modules/module.py
@@ -453,6 +453,15 @@
def register_backward_hook(self, hook):
r"""Registers a backward hook on the module.
+ .. warning ::
+
+ The current implementation will not have the presented behavior
+ for complex :class:`Module` that perform many operations.
+ In some failure cases, :attr:`grad_input` and :attr:`grad_output` will only
+ contain the gradients for a subset of the inputs and outputs.
+ For such :class:`Module`, you should use :func:`torch.Tensor.register_hook`
+ directly on a specific input or output to get the required gradients.
+
The hook will be called every time the gradients with respect to module
inputs are computed. The hook should have the following signature::
@@ -462,22 +471,14 @@
module has multiple inputs or outputs. The hook should not modify its
arguments, but it can optionally return a new gradient with respect to
input that will be used in place of :attr:`grad_input` in subsequent
- computations.
+ computations. :attr:`grad_input` will only correspond to the inputs given
+ as positional arguments.
Returns:
:class:`torch.utils.hooks.RemovableHandle`:
a handle that can be used to remove the added hook by calling
``handle.remove()``
- .. warning ::
-
- The current implementation will not have the presented behavior
- for complex :class:`Module` that perform many operations.
- In some failure cases, :attr:`grad_input` and :attr:`grad_output` will only
- contain the gradients for a subset of the inputs and outputs.
- For such :class:`Module`, you should use :func:`torch.Tensor.register_hook`
- directly on a specific input or output to get the required gradients.
-
"""
handle = hooks.RemovableHandle(self._backward_hooks)
self._backward_hooks[handle.id] = hook
@@ -491,6 +492,8 @@
hook(module, input) -> None or modified input
+ The input contains only the positional arguments given to the module.
+ Keyword arguments won't be passed to the hooks and only to the ``forward``.
The hook can modify the input. User can either return a tuple or a
single modified value in the hook. We will wrap the value into a tuple
if a single value is returned(unless that value is already a tuple).
@@ -512,6 +515,8 @@
hook(module, input, output) -> None or modified output
+ The input contains only the positional arguments given to the module.
+ Keyword arguments won't be passed to the hooks and only to the ``forward``.
The hook can modify the output. It can modify the input inplace but
it will not have effect on forward since this is called after
:func:`forward` is called.
diff --git a/torch/nn/modules/pooling.py b/torch/nn/modules/pooling.py
index 7a84222..32442ba 100644
--- a/torch/nn/modules/pooling.py
+++ b/torch/nn/modules/pooling.py
@@ -517,7 +517,7 @@
padding: implicit zero padding to be added on both sides
ceil_mode: when True, will use `ceil` instead of `floor` to compute the output shape
count_include_pad: when True, will include the zero-padding in the averaging calculation
- divisor_override: if specified, it will be used as divisor, otherwise attr:`kernel_size` will be used
+ divisor_override: if specified, it will be used as divisor, otherwise :attr:`kernel_size` will be used
Shape:
- Input: :math:`(N, C, H_{in}, W_{in})`
@@ -588,7 +588,7 @@
padding: implicit zero padding to be added on all three sides
ceil_mode: when True, will use `ceil` instead of `floor` to compute the output shape
count_include_pad: when True, will include the zero-padding in the averaging calculation
- divisor_override: if specified, it will be used as divisor, otherwise attr:`kernel_size` will be used
+ divisor_override: if specified, it will be used as divisor, otherwise :attr:`kernel_size` will be used
Shape:
- Input: :math:`(N, C, D_{in}, H_{in}, W_{in})`