|  | Quickstart | 
|  | =========== | 
|  |  | 
|  | To launch a **fault-tolerant** job, run the following on all nodes. | 
|  |  | 
|  | .. code-block:: bash | 
|  |  | 
|  | torchrun | 
|  | --nnodes=NUM_NODES | 
|  | --nproc_per_node=TRAINERS_PER_NODE | 
|  | --max_restarts=NUM_ALLOWED_FAILURES | 
|  | --rdzv_id=JOB_ID | 
|  | --rdzv_backend=c10d | 
|  | --rdzv_endpoint=HOST_NODE_ADDR | 
|  | YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...) | 
|  |  | 
|  |  | 
|  | To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes | 
|  | and at most ``MAX_SIZE`` nodes. | 
|  |  | 
|  | .. code-block:: bash | 
|  |  | 
|  | torchrun | 
|  | --nnodes=MIN_SIZE:MAX_SIZE | 
|  | --nproc_per_node=TRAINERS_PER_NODE | 
|  | --max_restarts=NUM_ALLOWED_FAILURES_OR_MEMBERSHIP_CHANGES | 
|  | --rdzv_id=JOB_ID | 
|  | --rdzv_backend=c10d | 
|  | --rdzv_endpoint=HOST_NODE_ADDR | 
|  | YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...) | 
|  |  | 
|  | .. note:: | 
|  | TorchElastic models failures as membership changes. When a node fails, | 
|  | this is treated as a "scale down" event. When the failed node is replaced by | 
|  | the scheduler, it is a "scale up" event. Hence for both fault tolerant | 
|  | and elastic jobs, ``--max_restarts`` is used to control the total number of | 
|  | restarts before giving up, regardless of whether the restart was caused | 
|  | due to a failure or a scaling event. | 
|  |  | 
|  | ``HOST_NODE_ADDR``, in form <host>[:<port>] (e.g. node1.example.com:29400), | 
|  | specifies the node and the port on which the C10d rendezvous backend should be | 
|  | instantiated and hosted. It can be any node in your training cluster, but | 
|  | ideally you should pick a node that has a high bandwidth. | 
|  |  | 
|  | .. note:: | 
|  | If no port number is specified ``HOST_NODE_ADDR`` defaults to 29400. | 
|  |  | 
|  | .. note:: | 
|  | The ``--standalone`` option can be passed to launch a single node job with a | 
|  | sidecar rendezvous backend. You don’t have to pass ``--rdzv_id``, | 
|  | ``--rdzv_endpoint``, and ``--rdzv_backend`` when the ``--standalone`` option | 
|  | is used. | 
|  |  | 
|  |  | 
|  | .. note:: | 
|  | Learn more about writing your distributed training script | 
|  | `here <train_script.html>`_. | 
|  |  | 
|  | If ``torchrun`` does not meet your requirements you may use our APIs directly | 
|  | for more powerful customization. Start by taking a look at the | 
|  | `elastic agent <agent.html>`_ API. |