docs/source/ddp_comm_hooks.rst - platform/external/pytorch - Git at Google

 DDP Communication Hooks
 =======================

 DDP communication hook is a generic interface to control how to communicate
 gradients across workers by overriding the vanilla allreduce in
 `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.>`_.
 A few built-in communication hooks are provided,
 and users can easily apply any of these hooks to optimize communication.
 Besides, the hook interface can also support user-defined communication
 strategies for more advanced use cases.

 .. warning ::
     DDP communication hook is experimental and subject to change.

 .. warning ::
     DDP communication hooks can only support single process single device mode
     on NCCL backend.

 How to Use a Communication Hook?
 --------------------------------

 To use a communication hook, the user just needs to let the DDP model register
 the hook before the training loop as below.

 :func:`torch.nn.parallel.DistributedDataParallel.register_comm_hook`.
     :noindex:

 Default Communication Hooks
 ---------------------------

 Default communication hooks are simple **stateless** hooks, so the input state
 in ``register_comm_hook`` is either a process group or ``None``.

 .. automodule:: torch.distributed.algorithms.ddp_comm_hooks.default_hooks
     :members:

 PowerSGD Communication Hook
 ---------------------------

 PowerSGD (`Vogels et al., NeurIPS 2019 <https://arxiv.org/abs/1905.13727>`_)
 is a gradient compression algorithm, which can provide very high compression
 rates and accelerate bandwidth-bound distributed training.
 This algorithm needs to maintain both some hyperparameters and the internal
 state. Therefore, PowerSGD communication hook is a **stateful** hook,
 and the user needs to provide a state object defined as below.

 PowerSGD State
 ^^^^^^^^^^^^^^^^

 .. currentmodule:: torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook
 .. autoclass:: PowerSGDState

 PowerSGD Hooks
 ^^^^^^^^^^^^^^^^

 .. warning ::
     PowerSGD typically requires extra memory of the same size as the model's
     gradients to enable error feedback, which can compensate for biased
     compressed communication and improve accuracy.

 .. warning ::
     The current implementation may cause gradient overflow for FP16 input.

 .. autofunction:: powerSGD_hook
 .. autofunction:: batched_powerSGD_hook

 Acknowledgements
 ----------------

 Many thanks to PowerSGD paper author **Thijs Vogels** for the code review on
 PowerSGD communication hook, as well as the
 `comparison experiments <https://observablehq.com/@tvogels/powersgd-benchmark>`_,
 which show that the performance of PowerSGD communication hook is on par with
 the implementation in the original `paper <https://arxiv.org/abs/1905.13727>`_.
	DDP Communication Hooks
	=======================

	DDP communication hook is a generic interface to control how to communicate
	gradients across workers by overriding the vanilla allreduce in
	`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.>`_.
	A few built-in communication hooks are provided,
	and users can easily apply any of these hooks to optimize communication.
	Besides, the hook interface can also support user-defined communication
	strategies for more advanced use cases.

	.. warning ::
	DDP communication hook is experimental and subject to change.

	.. warning ::
	DDP communication hooks can only support single process single device mode
	on NCCL backend.

	How to Use a Communication Hook?
	--------------------------------

	To use a communication hook, the user just needs to let the DDP model register
	the hook before the training loop as below.

	:func:`torch.nn.parallel.DistributedDataParallel.register_comm_hook`.
	:noindex:

	Default Communication Hooks
	---------------------------

	Default communication hooks are simple stateless hooks, so the input state
	in ``register_comm_hook`` is either a process group or ``None``.

	.. automodule:: torch.distributed.algorithms.ddp_comm_hooks.default_hooks
	:members:

	PowerSGD Communication Hook
	---------------------------

	PowerSGD (`Vogels et al., NeurIPS 2019 <https://arxiv.org/abs/1905.13727>`_)
	is a gradient compression algorithm, which can provide very high compression
	rates and accelerate bandwidth-bound distributed training.
	This algorithm needs to maintain both some hyperparameters and the internal
	state. Therefore, PowerSGD communication hook is a stateful hook,
	and the user needs to provide a state object defined as below.

	PowerSGD State
	^^^^^^^^^^^^^^^^

	.. currentmodule:: torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook
	.. autoclass:: PowerSGDState

	PowerSGD Hooks
	^^^^^^^^^^^^^^^^

	.. warning ::
	PowerSGD typically requires extra memory of the same size as the model's
	gradients to enable error feedback, which can compensate for biased
	compressed communication and improve accuracy.

	.. warning ::
	The current implementation may cause gradient overflow for FP16 input.

	.. autofunction:: powerSGD_hook
	.. autofunction:: batched_powerSGD_hook

	Acknowledgements
	----------------

	Many thanks to PowerSGD paper author Thijs Vogels for the code review on
	PowerSGD communication hook, as well as the
	`comparison experiments <https://observablehq.com/@tvogels/powersgd-benchmark>`_,
	which show that the performance of PowerSGD communication hook is on par with
	the implementation in the original `paper <https://arxiv.org/abs/1905.13727>`_.