Computing cluster (SLURM)¶

Lightning automates job the details behind training on a SLURM powered cluster.

Multi-node training¶

To train a model using multiple-nodes do the following:

Design your LightningModule.

Enable ddp in the trainer

# train on 32 GPUs across 4 nodes
trainer = Trainer(gpus=8, num_nodes=4, distributed_backend='ddp')

It’s a good idea to structure your train.py file like this:

# train.py
def main(hparams):
    model = LightningTemplateModel(hparams)

    trainer = pl.Trainer(
        gpus=8,
        num_nodes=4,
        distributed_backend='ddp'
    )

    trainer.fit(model)


if __name__ == '__main__':
    root_dir = os.path.dirname(os.path.realpath(__file__))
    parent_parser = ArgumentParser(add_help=False)
    hyperparams = parser.parse_args()

    # TRAIN
    main(hyperparams)

Create the appropriate SLURM job

# (submit.sh)
#!/bin/bash -l

# SLURM SUBMIT SCRIPT
#SBATCH --nodes=4
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
#SBATCH --time=0-02:00:00

# activate conda env
source activate $1

# -------------------------
# debugging flags (optional)
 export NCCL_DEBUG=INFO
 export PYTHONFAULTHANDLER=1

# on your cluster you might need these:
# set the network interface
# export NCCL_SOCKET_IFNAME=^docker0,lo

# might need the latest cuda
# module load NCCL/2.4.7-1-cuda.10.0
# -------------------------

# run script from above
srun python3 train.py

If you want auto-resubmit (read below), add this line to the submit.sh script
```
#SBATCH --signal=SIGUSR1@90
```
Submit the SLURM job
```
sbatch submit.sh
```

Note

using DistributedSampler is already handled by Lightning.

Walltime auto-resubmit¶

When you use Lightning in a SLURM cluster, lightning automatically detects when it is about to run into the walltime, and it does the following:

Saves a temporary checkpoint.
Requeues the job.
When the job starts, it loads the temporary checkpoint.

To get this behavior make sure to add the correct signal to your SLURM script

# 90 seconds before training ends
#SBATCH --signal=SIGUSR1@90