pytorch_lightning.trainer.distrib_data_parallel module¶

Lightning supports model training on a cluster managed by SLURM in the following cases:

Training on a single cpu or single GPU.
Train on multiple GPUs on the same node using DataParallel or DistributedDataParallel
Training across multiple GPUs on multiple different nodes via DistributedDataParallel.

Note

A node means a machine with multiple GPUs

Running grid search on a cluster¶

To use lightning to run a hyperparameter search (grid-search or random-search) on a cluster do 4 things:

(1). Define the parameters for the grid search

from test_tube import HyperOptArgumentParser

# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')

# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])

hparams = parser.parse_args()

Note

You must set Tunable=True for that argument to be considered in the permutation set. Otherwise test-tube will use the default value. This flag is useful when you don’t want to search over an argument and want to use the default instead.

(2). Define the cluster options in the: SlurmCluster object (over 5 nodes and 8 gpus)

from test_tube.hpc import SlurmCluster

# hyperparameters is a test-tube hyper params object
# see https://williamfalcon.github.io/test-tube/hyperparameter_optimization/HyperOptArgumentParser/
hyperparams = args.parse()

# init cluster
cluster = SlurmCluster(
    hyperparam_optimizer=hyperparams,
    log_path='/path/to/log/results/to',
    python_cmd='python3'
)

# let the cluster know where to email for a change in job status (ie: complete, fail, etc...)
cluster.notify_job_status(email='some@email.com', on_done=True, on_fail=True)

# set the job options. In this instance, we'll run 20 different models
# each with its own set of hyperparameters giving each one 1 GPU (ie: taking up 20 GPUs)
cluster.per_experiment_nb_gpus = 8
cluster.per_experiment_nb_nodes = 5

# we'll request 10GB of memory per node
cluster.memory_mb_per_node = 10000

# set a walltime of 10 minues
cluster.job_time = '10:00'

(3). Make a main function with your model and trainer. Each job will call this function with a particular hparams configuration.:

from pytorch_lightning import Trainer

def train_fx(trial_hparams, cluster_manager, _):
    # hparams has a specific set of hyperparams

    my_model = MyLightningModel()

    # give the trainer the cluster object
    trainer = Trainer()
    trainer.fit(my_model)

`

(4). Start the grid/random search:

# run the models on the cluster
cluster.optimize_parallel_cluster_gpu(
    train_fx,
    nb_trials=20,
    job_name='my_grid_search_exp_name',
    job_display_name='my_exp')

Note

nb_trials specifies how many of the possible permutations to use. If using grid_search it will use the depth first ordering. If using random_search it will use the first k shuffled options. FYI, random search has been shown to be just as good as any Bayesian optimization method when using a reasonable number of samples (60), see this paper for more information.

Walltime auto-resubmit¶

Lightning automatically resubmits jobs when they reach the walltime. Make sure to set the SIGUSR1 signal in your SLURM script.:

# 90 seconds before training ends
#SBATCH --signal=SIGUSR1@90

When lightning receives the SIGUSR1 signal it will: 1. save a checkpoint with ‘hpc_ckpt’ in the name. 2. resubmit the job using the SLURM_JOB_ID

When the script starts again, Lightning will: 1. search for a ‘hpc_ckpt’ checkpoint. 2. restore the model, optimizers, schedulers, epoch, etc…

class pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin[source]¶

Bases: abc.ABC

_set_horovod_backend()[source]¶

check_horovod()[source]¶: Raises a MisconfigurationException if the Trainer is not configured correctly for Horovod.

configure_slurm_ddp(num_gpu_nodes)[source]¶

abstract copy_trainer_model_properties(*args)[source]¶: Warning: this is just empty shell for code implemented in other class.

ddp_train(process_idx, model)[source]¶: Entry point into a DP thread :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.ddp_train.gpu_idx: :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.ddp_train.model: :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.ddp_train.cluster_obj: :return:

determine_ddp_node_rank()[source]¶

static has_horovodrun()[source]¶: Returns True if running with horovodrun using Gloo or OpenMPI.

abstract init_optimizers(*args)[source]¶: Warning: this is just empty shell for code implemented in other class.

init_tpu()[source]¶

load_spawn_weights(original_model)[source]¶: Load the temp weights saved in the process To recover the trained model from the ddp process we load the saved weights :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.load_spawn_weights.model: :return:

resolve_root_node_address(root_node)[source]¶

abstract run_pretrain_routine(*args)[source]¶: Warning: this is just empty shell for code implemented in other class.

save_spawn_weights(model)[source]¶: Dump a temporary checkpoint after ddp ends to get weights out of the process :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.save_spawn_weights.model: :return:

set_distributed_mode(distributed_backend)[source]¶

set_nvidia_flags(is_slurm_managing_tasks, data_parallel_device_ids)[source]¶

amp_level: str = None[source]¶

checkpoint_callback: Union[ModelCheckpoint, bool] = None[source]¶

data_parallel_device_ids: ... = None[source]¶

default_root_dir: str = None[source]¶

distributed_backend: str = None[source]¶

logger: Union[LightningLoggerBase, bool] = None[source]¶

num_gpu_nodes: int = None[source]¶

abstract property num_gpus[source]¶

this is just empty shell for code implemented in other class.

Type: Warning
Return type: int

num_processes: int = None[source]¶

on_gpu: bool = None[source]¶

progress_bar_callback: ... = None[source]¶

abstract property use_amp[source]¶

this is just empty shell for code implemented in other class.

Type: Warning
Return type: bool

use_native_amp: bool = None[source]¶

use_tpu: bool = None[source]¶