| Multiprocessing best practices |
| ============================== |
| |
| :mod:`torch.multiprocessing` is a drop in replacement for Python's |
| :mod:`python:multiprocessing` module. It supports the exact same operations, |
| but extends it, so that all tensors sent through a |
| :class:`python:multiprocessing.Queue`, will have their data moved into shared |
| memory and will only send a handle to another process. |
| |
| .. note:: |
| |
| When a :class:`~torch.autograd.Variable` is sent to another process, both |
| the :attr:`Variable.data` and :attr:`Variable.grad.data` are going to be |
| shared. |
| |
| This allows to implement various training methods, like Hogwild, A3C, or any |
| others that require asynchronous operation. |
| |
| Sharing CUDA tensors |
| -------------------- |
| |
| Sharing CUDA tensors between processes is supported only in Python 3, using |
| a ``spawn`` or ``forkserver`` start methods. :mod:`python:multiprocessing` in |
| Python 2 can only create subprocesses using ``fork``, and it's not supported |
| by the CUDA runtime. |
| |
| .. warning:: |
| |
| CUDA API requires that the allocation exported to other processes remains |
| valid as long as it's used by them. You should be careful and ensure that |
| CUDA tensors you shared don't go out of scope as long as it's necessary. |
| This shouldn't be a problem for sharing model parameters, but passing other |
| kinds of data should be done with care. Note that this restriction doesn't |
| apply to shared CPU memory. |
| |
| |
| Best practices and tips |
| ----------------------- |
| |
| Avoiding and fighting deadlocks |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| There are a lot of things that can go wrong when a new process is spawned, with |
| the most common cause of deadlocks being background threads. If there's any |
| thread that holds a lock or imports a module, and ``fork`` is called, it's very |
| likely that the subprocess will be in a corrupted state and will deadlock or |
| fail in a different way. Note that even if you don't, Python built in |
| libraries do - no need to look further than :mod:`python:multiprocessing`. |
| :class:`python:multiprocessing.Queue` is actually a very complex class, that |
| spawns multiple threads used to serialize, send and receive objects, and they |
| can cause aforementioned problems too. If you find yourself in such situation |
| try using a :class:`~python:multiprocessing.queues.SimpleQueue`, that doesn't |
| use any additional threads. |
| |
| We're trying our best to make it easy for you and ensure these deadlocks don't |
| happen but some things are out of our control. If you have any issues you can't |
| cope with for a while, try reaching out on forums, and we'll see if it's an |
| issue we can fix. |
| |
| Reuse buffers passed through a Queue |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Remember that each time you put a :class:`~torch.Tensor` into a |
| :class:`python:multiprocessing.Queue`, it has to be moved into shared memory. |
| If it's already shared, it is a no-op, otherwise it will incur an additional |
| memory copy that can slow down the whole process. Even if you have a pool of |
| processes sending data to a single one, make it send the buffers back - this |
| is nearly free and will let you avoid a copy when sending next batch. |
| |
| Asynchronous multiprocess training (e.g. Hogwild) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Using :mod:`torch.multiprocessing`, it is possible to train a model |
| asynchronously, with parameters either shared all the time, or being |
| periodically synchronized. In the first case, we recommend sending over the whole |
| model object, while in the latter, we advise to only send the |
| :meth:`~torch.nn.Module.state_dict`. |
| |
| We recommend using :class:`python:multiprocessing.Queue` for passing all kinds |
| of PyTorch objects between processes. It is possible to e.g. inherit the tensors |
| and storages already in shared memory, when using the ``fork`` start method, |
| however it is very bug prone and should be used with care, and only by advanced |
| users. Queues, even though they're sometimes a less elegant solution, will work |
| properly in all cases. |
| |
| .. warning:: |
| |
| You should be careful about having global statements, that are not guarded |
| with an ``if __name__ == '__main__'``. If a different start method than |
| ``fork`` is used, they will be executed in all subprocesses. |
| |
| Hogwild |
| ~~~~~~~ |
| |
| A concrete Hogwild implementation can be found in the `examples repository`__, |
| but to showcase the overall structure of the code, there's also a minimal |
| example below as well:: |
| |
| import torch.multiprocessing as mp |
| from model import MyModel |
| |
| def train(model): |
| # This for loop will break sharing of gradient buffers. It's not |
| # necessary but it reduces the contention, and has a small memory cost |
| # (equal to the total size of parameters). |
| for param in model.parameters(): |
| param.grad.data = param.grad.data.clone() |
| # Construct data_loader, optimizer, etc. |
| for data, labels in data_loader: |
| optimizer.zero_grad() |
| loss_fn(model(data), labels).backward() |
| optimizer.step() # This will update the shared parameters |
| |
| if __name__ == '__main__': |
| num_processes = 4 |
| model = MyModel() |
| # NOTE: this is required for the ``fork`` method to work |
| model.share_memory() |
| processes = [] |
| for rank in range(num_processes): |
| p = mp.Process(target=train, args=(model,)) |
| p.start() |
| processes.append(p) |
| for p in processes: |
| p.join() |
| |
| .. __: https://github.com/pytorch/examples/tree/master/mnist_hogwild |