Run on an on-prem cluster (intermediate)¶
Run with TorchDistributed¶
Once the script is setup like described in :ref:` Training Script Setup<training_script_setup>`, you can run the below command across your nodes to start multi-node training.
Like a custom cluster, you have to ensure that there is network connectivity between the nodes with firewall rules that allow traffic flow on a specified MASTER_PORT.
Finally, you’ll need to decide which node you’d like to be the main node (MASTER_ADDR), and the ranks of each node (NODE_RANK).
NODE_RANK 0 for the first node, 1 for the second node
Run the below command with the appropriate variables set on each node.
python -m torch.distributed.run --nnodes=2 # number of nodes you'd like to run with --master_addr <MASTER_ADDR> --master_port <MASTER_PORT> --node_rank <NODE_RANK> train.py (--arg1 ... train script args...)
torch.distributed.run assumes that you’d like to spawn a process per GPU if GPU devices are found on the node. This can be adjusted with
Setting up a cluster for distributed training is not trivial. Lightning offers lightning-grid which allows you to configure a cluster easily and run experiments via the CLI and web UI.
Try it out for free today: