docs/source/ddp_comm_hooks.rst - platform/external/pytorch - Git at Google

 DDP Communication Hooks
 =======================

 DDP communication hook is a generic interface to control how to communicate
 gradients across workers by overriding the vanilla allreduce in
 `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.>`_.
 A few built-in communication hooks are provided,
 and users can easily apply any of these hooks to optimize communication.
 Besides, the hook interface can also support user-defined communication
 strategies for more advanced use cases.

 How to Use a Communication Hook?
 --------------------------------

 To use a communication hook, the user just needs to let the DDP model register
 the hook before the training loop as below.

 :func:`torch.nn.parallel.DistributedDataParallel.register_comm_hook`

 What Does a Communication Hook Operate On?
 ------------------------------------------

 Communication hook provides a flexible way to allreduce gradients.
 Therefore, it mainly operates on the gradients on each replica before allreduce,
 which are bucketized to increase the overlap between communication and computation.
 Particularly, :class:`torch.distributed.GradBucket` represents a bucket of gradient tensors to be allreduced.

 .. autoclass:: torch.distributed.GradBucket

 .. autofunction:: torch.distributed.GradBucket.index
 .. autofunction:: torch.distributed.GradBucket.buffer
 .. autofunction:: torch.distributed.GradBucket.gradients
 .. autofunction:: torch.distributed.GradBucket.is_last
 .. autofunction:: torch.distributed.GradBucket.set_buffer
 .. autofunction:: torch.distributed.GradBucket.parameters

 Default Communication Hooks
 ---------------------------

 Default communication hooks are simple **stateless** hooks, so the input state
 in ``register_comm_hook`` is either a process group or ``None``.
 The input ``bucket`` is a :class:`torch.distributed.GradBucket` object.

 .. currentmodule:: torch.distributed.algorithms.ddp_comm_hooks.default_hooks
 .. autofunction:: allreduce_hook
 .. autofunction:: fp16_compress_hook

 Additionally, a communication hook wraper is provided to support :meth:`~fp16_compress_hook` as a wrapper,
 which can be combined with other communication hooks.

 .. autofunction:: fp16_compress_wrapper

 PowerSGD Communication Hook
 ---------------------------

 PowerSGD (`Vogels et al., NeurIPS 2019 <https://arxiv.org/abs/1905.13727>`_)
 is a gradient compression algorithm, which can provide very high compression
 rates and accelerate bandwidth-bound distributed training.
 This algorithm needs to maintain both some hyperparameters and the internal
 state. Therefore, PowerSGD communication hook is a **stateful** hook,
 and the user needs to provide a state object defined as below.

 PowerSGD State
 ^^^^^^^^^^^^^^^^

 .. currentmodule:: torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook
 .. autoclass:: PowerSGDState

 PowerSGD Hooks
 ^^^^^^^^^^^^^^^^

 .. warning ::
     PowerSGD typically requires extra memory of the same size as the model's
     gradients to enable error feedback, which can compensate for biased
     compressed communication and improve accuracy.

 .. warning ::
     PowerSGD hooks may conflict with `Apex automatic mixed precision package <https://github.com/NVIDIA/apex>`_.
     Please use PyTorch `native automatic mixed precision package <https://pytorch.org/docs/stable/amp.html>`_
     instead.

 .. autofunction:: powerSGD_hook
 .. autofunction:: batched_powerSGD_hook

 Acknowledgements
 ----------------

 Many thanks to PowerSGD paper author **Thijs Vogels** for the code review on
 PowerSGD communication hook, as well as the
 `comparison experiments <https://observablehq.com/@tvogels/powersgd-benchmark>`_,
 which show that the performance of PowerSGD communication hook is on par with
 the implementation in the original `paper <https://arxiv.org/abs/1905.13727>`_.
	DDP Communication Hooks
	=======================

	DDP communication hook is a generic interface to control how to communicate
	gradients across workers by overriding the vanilla allreduce in
	`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.>`_.
	A few built-in communication hooks are provided,
	and users can easily apply any of these hooks to optimize communication.
	Besides, the hook interface can also support user-defined communication
	strategies for more advanced use cases.

	How to Use a Communication Hook?
	--------------------------------

	To use a communication hook, the user just needs to let the DDP model register
	the hook before the training loop as below.

	:func:`torch.nn.parallel.DistributedDataParallel.register_comm_hook`

	What Does a Communication Hook Operate On?
	------------------------------------------

	Communication hook provides a flexible way to allreduce gradients.
	Therefore, it mainly operates on the gradients on each replica before allreduce,
	which are bucketized to increase the overlap between communication and computation.
	Particularly, :class:`torch.distributed.GradBucket` represents a bucket of gradient tensors to be allreduced.

	.. autoclass:: torch.distributed.GradBucket

	.. autofunction:: torch.distributed.GradBucket.index
	.. autofunction:: torch.distributed.GradBucket.buffer
	.. autofunction:: torch.distributed.GradBucket.gradients
	.. autofunction:: torch.distributed.GradBucket.is_last
	.. autofunction:: torch.distributed.GradBucket.set_buffer
	.. autofunction:: torch.distributed.GradBucket.parameters

	Default Communication Hooks
	---------------------------

	Default communication hooks are simple stateless hooks, so the input state
	in ``register_comm_hook`` is either a process group or ``None``.
	The input ``bucket`` is a :class:`torch.distributed.GradBucket` object.

	.. currentmodule:: torch.distributed.algorithms.ddp_comm_hooks.default_hooks
	.. autofunction:: allreduce_hook
	.. autofunction:: fp16_compress_hook

	Additionally, a communication hook wraper is provided to support :meth:`~fp16_compress_hook` as a wrapper,
	which can be combined with other communication hooks.

	.. autofunction:: fp16_compress_wrapper

	PowerSGD Communication Hook
	---------------------------

	PowerSGD (`Vogels et al., NeurIPS 2019 <https://arxiv.org/abs/1905.13727>`_)
	is a gradient compression algorithm, which can provide very high compression
	rates and accelerate bandwidth-bound distributed training.
	This algorithm needs to maintain both some hyperparameters and the internal
	state. Therefore, PowerSGD communication hook is a stateful hook,
	and the user needs to provide a state object defined as below.

	PowerSGD State
	^^^^^^^^^^^^^^^^

	.. currentmodule:: torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook
	.. autoclass:: PowerSGDState

	PowerSGD Hooks
	^^^^^^^^^^^^^^^^

	.. warning ::
	PowerSGD typically requires extra memory of the same size as the model's
	gradients to enable error feedback, which can compensate for biased
	compressed communication and improve accuracy.

	.. warning ::
	PowerSGD hooks may conflict with `Apex automatic mixed precision package <https://github.com/NVIDIA/apex>`_.
	Please use PyTorch `native automatic mixed precision package <https://pytorch.org/docs/stable/amp.html>`_
	instead.

	.. autofunction:: powerSGD_hook
	.. autofunction:: batched_powerSGD_hook

	Acknowledgements
	----------------

	Many thanks to PowerSGD paper author Thijs Vogels for the code review on
	PowerSGD communication hook, as well as the
	`comparison experiments <https://observablehq.com/@tvogels/powersgd-benchmark>`_,
	which show that the performance of PowerSGD communication hook is on par with
	the implementation in the original `paper <https://arxiv.org/abs/1905.13727>`_.