| torch.optim |
| =================================== |
| |
| .. automodule:: torch.optim |
| |
| How to use an optimizer |
| ----------------------- |
| |
| To use :mod:`torch.optim` you have to construct an optimizer object, that will hold |
| the current state and will update the parameters based on the computed gradients. |
| |
| Constructing it |
| ^^^^^^^^^^^^^^^ |
| |
| To construct an :class:`Optimizer` you have to give it an iterable containing the |
| parameters (all should be :class:`~torch.autograd.Variable` s) to optimize. Then, |
| you can specify optimizer-specific options such as the learning rate, weight decay, etc. |
| |
| .. note:: |
| |
| If you need to move a model to GPU via ``.cuda()``, please do so before |
| constructing optimizers for it. Parameters of a model after ``.cuda()`` will |
| be different objects with those before the call. |
| |
| In general, you should make sure that optimized parameters live in |
| consistent locations when optimizers are constructed and used. |
| |
| Example:: |
| |
| optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) |
| optimizer = optim.Adam([var1, var2], lr=0.0001) |
| |
| Per-parameter options |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| :class:`Optimizer` s also support specifying per-parameter options. To do this, instead |
| of passing an iterable of :class:`~torch.autograd.Variable` s, pass in an iterable of |
| :class:`dict` s. Each of them will define a separate parameter group, and should contain |
| a ``params`` key, containing a list of parameters belonging to it. Other keys |
| should match the keyword arguments accepted by the optimizers, and will be used |
| as optimization options for this group. |
| |
| .. note:: |
| |
| You can still pass options as keyword arguments. They will be used as |
| defaults, in the groups that didn't override them. This is useful when you |
| only want to vary a single option, while keeping all others consistent |
| between parameter groups. |
| |
| |
| For example, this is very useful when one wants to specify per-layer learning rates:: |
| |
| optim.SGD([ |
| {'params': model.base.parameters()}, |
| {'params': model.classifier.parameters(), 'lr': 1e-3} |
| ], lr=1e-2, momentum=0.9) |
| |
| This means that ``model.base``'s parameters will use the default learning rate of ``1e-2``, |
| ``model.classifier``'s parameters will use a learning rate of ``1e-3``, and a momentum of |
| ``0.9`` will be used for all parameters. |
| |
| Taking an optimization step |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| All optimizers implement a :func:`~Optimizer.step` method, that updates the |
| parameters. It can be used in two ways: |
| |
| ``optimizer.step()`` |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| This is a simplified version supported by most optimizers. The function can be |
| called once the gradients are computed using e.g. |
| :func:`~torch.autograd.Variable.backward`. |
| |
| Example:: |
| |
| for input, target in dataset: |
| optimizer.zero_grad() |
| output = model(input) |
| loss = loss_fn(output, target) |
| loss.backward() |
| optimizer.step() |
| |
| ``optimizer.step(closure)`` |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Some optimization algorithms such as Conjugate Gradient and LBFGS need to |
| reevaluate the function multiple times, so you have to pass in a closure that |
| allows them to recompute your model. The closure should clear the gradients, |
| compute the loss, and return it. |
| |
| Example:: |
| |
| for input, target in dataset: |
| def closure(): |
| optimizer.zero_grad() |
| output = model(input) |
| loss = loss_fn(output, target) |
| loss.backward() |
| return loss |
| optimizer.step(closure) |
| |
| Algorithms |
| ---------- |
| |
| .. autoclass:: Optimizer |
| :members: |
| .. autoclass:: Adadelta |
| :members: |
| .. autoclass:: Adagrad |
| :members: |
| .. autoclass:: Adam |
| :members: |
| .. autoclass:: AdamW |
| :members: |
| .. autoclass:: SparseAdam |
| :members: |
| .. autoclass:: Adamax |
| :members: |
| .. autoclass:: ASGD |
| :members: |
| .. autoclass:: LBFGS |
| :members: |
| .. autoclass:: RMSprop |
| :members: |
| .. autoclass:: Rprop |
| :members: |
| .. autoclass:: SGD |
| :members: |
| |
| How to adjust Learning Rate |
| --------------------------- |
| |
| :mod:`torch.optim.lr_scheduler` provides several methods to adjust the learning |
| rate based on the number of epochs. :class:`torch.optim.lr_scheduler.ReduceLROnPlateau` |
| allows dynamic learning rate reducing based on some validation measurements. |
| |
| Learning rate scheduling should be applied after optimizer's update; e.g., you |
| should write your code this way: |
| |
| >>> scheduler = ... |
| >>> for epoch in range(100): |
| >>> train(...) |
| >>> validate(...) |
| >>> scheduler.step() |
| |
| .. warning:: |
| Prior to PyTorch 1.1.0, the learning rate scheduler was expected to be called before |
| the optimizer's update; 1.1.0 changed this behavior in a BC-breaking way. If you use |
| the learning rate scheduler (calling ``scheduler.step()``) before the optimizer's update |
| (calling ``optimizer.step()``), this will skip the first value of the learning rate schedule. |
| If you are unable to reproduce results after upgrading to PyTorch 1.1.0, please check |
| if you are calling ``scheduler.step()`` at the wrong time. |
| |
| |
| .. autoclass:: torch.optim.lr_scheduler.LambdaLR |
| :members: |
| .. autoclass:: torch.optim.lr_scheduler.StepLR |
| :members: |
| .. autoclass:: torch.optim.lr_scheduler.MultiStepLR |
| :members: |
| .. autoclass:: torch.optim.lr_scheduler.ExponentialLR |
| :members: |
| .. autoclass:: torch.optim.lr_scheduler.CosineAnnealingLR |
| :members: |
| .. autoclass:: torch.optim.lr_scheduler.ReduceLROnPlateau |
| :members: |
| .. autoclass:: torch.optim.lr_scheduler.CyclicLR |
| :members: |
| .. autoclass:: torch.optim.lr_scheduler.OneCycleLR |
| :members: |
| .. autoclass:: torch.optim.lr_scheduler.CosineAnnealingWarmRestarts |
| :members: |