Shortcuts

Computing cluster

With Lightning it is easy to run your training script on a computing cluster without almost any modifications to the script. This guide shows how to run a training job on a general purpose cluster.

Also, check Accelerators as a new and more general approach to a cluster setup.


Cluster setup

To setup a multi-node computing cluster you need:

  1. Multiple computers with PyTorch Lightning installed

  2. A network connectivity between them with firewall rules that allow traffic flow on a specified MASTER_PORT.

  3. Defined environment variables on each node required for the PyTorch Lightning multi-node distributed training

PyTorch Lightning follows the design of PyTorch distributed communication package. and requires the following environment variables to be defined on each node:

  • MASTER_PORT - required; has to be a free port on machine with NODE_RANK 0

  • MASTER_ADDR - required (except for NODE_RANK 0); address of NODE_RANK 0 node

  • WORLD_SIZE - required; how many nodes are in the cluster

  • NODE_RANK - required; id of the node in the cluster

Training script design

To train a model using multiple nodes, do the following:

  1. Design your LightningModule (no need to add anything specific here).

  2. Enable DDP in the trainer

    # train on 32 GPUs across 4 nodes
    trainer = Trainer(gpus=8, num_nodes=4, accelerator='ddp')
    

Submit a job to the cluster

To submit a training job to the cluster you need to run the same training script on each node of the cluster. This means that you need to:

  1. Copy all third-party libraries to each node (usually means - distribute requirements.txt file and install it).

  2. Copy all your import dependencies and the script itself to each node.

  3. Run the script on each node.

Read the Docs v: stable
Versions
latest
stable
1.2.2
1.2.1
1.2.0
1.1.8
1.1.7
1.1.6
1.1.5
1.1.4
1.1.3
1.1.2
1.1.1
1.1.0
1.0.8
1.0.7
1.0.6
1.0.5
1.0.4
1.0.3
1.0.2
1.0.1
1.0.0
0.10.0
0.9.0
0.8.5
0.8.4
0.8.3
0.8.2
0.8.1
0.8.0
0.7.6
0.7.5
0.7.4
0.7.3
0.7.2
0.7.1
0.7.0
0.6.0
0.5.3.2
0.5.3
0.4.9
release-1.2-dev
release-1.0.x
Downloads
pdf
html
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.