Shortcuts

Computing cluster (SLURM)

Lightning automates job the details behind training on a SLURM powered cluster.

Multi-node training

To train a model using multiple-nodes do the following:

  1. Design your LightningModule.

  2. Enable ddp in the trainer

    # train on 32 GPUs across 4 nodes
    trainer = Trainer(gpus=8, num_nodes=4, distributed_backend='ddp')
    
  3. It’s a good idea to structure your train.py file like this:

    # train.py
    def main(hparams):
        model = LightningTemplateModel(hparams)
    
        trainer = pl.Trainer(
            gpus=8,
            num_nodes=4,
            distributed_backend='ddp'
        )
    
        trainer.fit(model)
    
    
    if __name__ == '__main__':
        root_dir = os.path.dirname(os.path.realpath(__file__))
        parent_parser = ArgumentParser(add_help=False)
        hyperparams = parser.parse_args()
    
        # TRAIN
        main(hyperparams)
    
  4. Create the appropriate SLURM job

    # (submit.sh)
    #!/bin/bash -l
    
    # SLURM SUBMIT SCRIPT
    #SBATCH --nodes=4
    #SBATCH --gres=gpu:8
    #SBATCH --ntasks-per-node=8
    #SBATCH --mem=0
    #SBATCH --time=0-02:00:00
    
    # activate conda env
    source activate $1
    
    # -------------------------
    # debugging flags (optional)
     export NCCL_DEBUG=INFO
     export PYTHONFAULTHANDLER=1
    
    # on your cluster you might need these:
    # set the network interface
    # export NCCL_SOCKET_IFNAME=^docker0,lo
    
    # might need the latest cuda
    # module load NCCL/2.4.7-1-cuda.10.0
    # -------------------------
    
    # run script from above
    srun python3 train.py
    
  5. If you want auto-resubmit (read below), add this line to the submit.sh script

    #SBATCH --signal=SIGUSR1@90
    
  6. Submit the SLURM job

    sbatch submit.sh
    

Note

using DistributedSampler is already handled by Lightning.

Walltime auto-resubmit

When you use Lightning in a SLURM cluster, lightning automatically detects when it is about to run into the walltime, and it does the following:

  1. Saves a temporary checkpoint.

  2. Requeues the job.

  3. When the job starts, it loads the temporary checkpoint.

To get this behavior make sure to add the correct signal to your SLURM script

# 90 seconds before training ends
#SBATCH --signal=SIGUSR1@90