|  | r""" | 
|  | ``torch.distributed.launch`` is a module that spawns up multiple distributed | 
|  | training processes on each of the training nodes. | 
|  |  | 
|  | .. warning:: | 
|  |  | 
|  | This module is going to be deprecated in favor of :ref:`torchrun <launcher-api>`. | 
|  |  | 
|  | The utility can be used for single-node distributed training, in which one or | 
|  | more processes per node will be spawned. The utility can be used for either | 
|  | CPU training or GPU training. If the utility is used for GPU training, | 
|  | each distributed process will be operating on a single GPU. This can achieve | 
|  | well-improved single-node training performance. It can also be used in | 
|  | multi-node distributed training, by spawning up multiple processes on each node | 
|  | for well-improved multi-node distributed training performance as well. | 
|  | This will especially be benefitial for systems with multiple Infiniband | 
|  | interfaces that have direct-GPU support, since all of them can be utilized for | 
|  | aggregated communication bandwidth. | 
|  |  | 
|  | In both cases of single-node distributed training or multi-node distributed | 
|  | training, this utility will launch the given number of processes per node | 
|  | (``--nproc_per_node``). If used for GPU training, this number needs to be less | 
|  | or equal to the number of GPUs on the current system (``nproc_per_node``), | 
|  | and each process will be operating on a single GPU from *GPU 0 to | 
|  | GPU (nproc_per_node - 1)*. | 
|  |  | 
|  | **How to use this module:** | 
|  |  | 
|  | 1. Single-Node multi-process distributed training | 
|  |  | 
|  | :: | 
|  |  | 
|  | >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE | 
|  | YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other | 
|  | arguments of your training script) | 
|  |  | 
|  | 2. Multi-Node multi-process distributed training: (e.g. two nodes) | 
|  |  | 
|  |  | 
|  | Node 1: *(IP: 192.168.1.1, and has a free port: 1234)* | 
|  |  | 
|  | :: | 
|  |  | 
|  | >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE | 
|  | --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" | 
|  | --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 | 
|  | and all other arguments of your training script) | 
|  |  | 
|  | Node 2: | 
|  |  | 
|  | :: | 
|  |  | 
|  | >>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE | 
|  | --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" | 
|  | --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 | 
|  | and all other arguments of your training script) | 
|  |  | 
|  | 3. To look up what optional arguments this module offers: | 
|  |  | 
|  | :: | 
|  |  | 
|  | >>> python -m torch.distributed.launch --help | 
|  |  | 
|  |  | 
|  | **Important Notices:** | 
|  |  | 
|  | 1. This utility and multi-process distributed (single-node or | 
|  | multi-node) GPU training currently only achieves the best performance using | 
|  | the NCCL distributed backend. Thus NCCL backend is the recommended backend to | 
|  | use for GPU training. | 
|  |  | 
|  | 2. In your training program, you must parse the command-line argument: | 
|  | ``--local_rank=LOCAL_PROCESS_RANK``, which will be provided by this module. | 
|  | If your training program uses GPUs, you should ensure that your code only | 
|  | runs on the GPU device of LOCAL_PROCESS_RANK. This can be done by: | 
|  |  | 
|  | Parsing the local_rank argument | 
|  |  | 
|  | :: | 
|  |  | 
|  | >>> import argparse | 
|  | >>> parser = argparse.ArgumentParser() | 
|  | >>> parser.add_argument("--local_rank", type=int) | 
|  | >>> args = parser.parse_args() | 
|  |  | 
|  | Set your device to local rank using either | 
|  |  | 
|  | :: | 
|  |  | 
|  | >>> torch.cuda.set_device(args.local_rank)  # before your code runs | 
|  |  | 
|  | or | 
|  |  | 
|  | :: | 
|  |  | 
|  | >>> with torch.cuda.device(args.local_rank): | 
|  | >>>    # your code to run | 
|  |  | 
|  | 3. In your training program, you are supposed to call the following function | 
|  | at the beginning to start the distributed backend. It is strongly recommended | 
|  | that ``init_method=env://``. Other init methods (e.g. ``tcp://``) may work, | 
|  | but ``env://`` is the one that is officially supported by this module. | 
|  |  | 
|  | :: | 
|  |  | 
|  | torch.distributed.init_process_group(backend='YOUR BACKEND', | 
|  | init_method='env://') | 
|  |  | 
|  | 4. In your training program, you can either use regular distributed functions | 
|  | or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your | 
|  | training program uses GPUs for training and you would like to use | 
|  | :func:`torch.nn.parallel.DistributedDataParallel` module, | 
|  | here is how to configure it. | 
|  |  | 
|  | :: | 
|  |  | 
|  | model = torch.nn.parallel.DistributedDataParallel(model, | 
|  | device_ids=[args.local_rank], | 
|  | output_device=args.local_rank) | 
|  |  | 
|  | Please ensure that ``device_ids`` argument is set to be the only GPU device id | 
|  | that your code will be operating on. This is generally the local rank of the | 
|  | process. In other words, the ``device_ids`` needs to be ``[args.local_rank]``, | 
|  | and ``output_device`` needs to be ``args.local_rank`` in order to use this | 
|  | utility | 
|  |  | 
|  | 5. Another way to pass ``local_rank`` to the subprocesses via environment variable | 
|  | ``LOCAL_RANK``. This behavior is enabled when you launch the script with | 
|  | ``--use_env=True``. You must adjust the subprocess example above to replace | 
|  | ``args.local_rank`` with ``os.environ['LOCAL_RANK']``; the launcher | 
|  | will not pass ``--local_rank`` when you specify this flag. | 
|  |  | 
|  | .. warning:: | 
|  |  | 
|  | ``local_rank`` is NOT globally unique: it is only unique per process | 
|  | on a machine.  Thus, don't use it to decide if you should, e.g., | 
|  | write to a networked filesystem.  See | 
|  | https://github.com/pytorch/pytorch/issues/12042 for an example of | 
|  | how things can go wrong if you don't do this correctly. | 
|  |  | 
|  |  | 
|  |  | 
|  | """ | 
|  |  | 
|  | import logging | 
|  | import warnings | 
|  |  | 
|  | from torch.distributed.run import get_args_parser, run | 
|  |  | 
|  |  | 
|  | logger = logging.getLogger(__name__) | 
|  |  | 
|  |  | 
|  | def parse_args(args): | 
|  | parser = get_args_parser() | 
|  | parser.add_argument( | 
|  | "--use_env", | 
|  | default=False, | 
|  | action="store_true", | 
|  | help="Use environment variable to pass " | 
|  | "'local rank'. For legacy reasons, the default value is False. " | 
|  | "If set to True, the script will not pass " | 
|  | "--local_rank as argument, and will instead set LOCAL_RANK.", | 
|  | ) | 
|  | return parser.parse_args(args) | 
|  |  | 
|  |  | 
|  | def launch(args): | 
|  | if args.no_python and not args.use_env: | 
|  | raise ValueError( | 
|  | "When using the '--no_python' flag," | 
|  | " you must also set the '--use_env' flag." | 
|  | ) | 
|  | run(args) | 
|  |  | 
|  |  | 
|  | def main(args=None): | 
|  | warnings.warn( | 
|  | "The module torch.distributed.launch is deprecated\n" | 
|  | "and will be removed in future. Use torchrun.\n" | 
|  | "Note that --use_env is set by default in torchrun.\n" | 
|  | "If your script expects `--local_rank` argument to be set, please\n" | 
|  | "change it to read from `os.environ['LOCAL_RANK']` instead. See \n" | 
|  | "https://pytorch.org/docs/stable/distributed.html#launch-utility for \n" | 
|  | "further instructions\n", | 
|  | FutureWarning, | 
|  | ) | 
|  | args = parse_args(args) | 
|  | launch(args) | 
|  |  | 
|  |  | 
|  | if __name__ == "__main__": | 
|  | main() |