pytorch_lightning.trainer.distrib_data_parallel module¶
Lightning supports model training on a cluster managed by SLURM in the following cases:
Training on a single cpu or single GPU.
Train on multiple GPUs on the same node using DataParallel or DistributedDataParallel
Training across multiple GPUs on multiple different nodes via DistributedDataParallel.
Note
A node means a machine with multiple GPUs
Running grid search on a cluster¶
To use lightning to run a hyperparameter search (grid-search or random-search) on a cluster do 4 things:
(1). Define the parameters for the grid search
from test_tube import HyperOptArgumentParser
# subclass of argparse
parser = HyperOptArgumentParser(strategy='random_search')
parser.add_argument('--learning_rate', default=0.002, type=float, help='the learning rate')
# let's enable optimizing over the number of layers in the network
parser.opt_list('--nb_layers', default=2, type=int, tunable=True, options=[2, 4, 8])
hparams = parser.parse_args()
Note
You must set Tunable=True for that argument to be considered in the permutation set. Otherwise test-tube will use the default value. This flag is useful when you don’t want to search over an argument and want to use the default instead.
- (2). Define the cluster options in the
SlurmCluster object (over 5 nodes and 8 gpus)
from test_tube.hpc import SlurmCluster
# hyperparameters is a test-tube hyper params object
# see https://williamfalcon.github.io/test-tube/hyperparameter_optimization/HyperOptArgumentParser/
hyperparams = args.parse()
# init cluster
cluster = SlurmCluster(
hyperparam_optimizer=hyperparams,
log_path='/path/to/log/results/to',
python_cmd='python3'
)
# let the cluster know where to email for a change in job status (ie: complete, fail, etc...)
cluster.notify_job_status(email='some@email.com', on_done=True, on_fail=True)
# set the job options. In this instance, we'll run 20 different models
# each with its own set of hyperparameters giving each one 1 GPU (ie: taking up 20 GPUs)
cluster.per_experiment_nb_gpus = 8
cluster.per_experiment_nb_nodes = 5
# we'll request 10GB of memory per node
cluster.memory_mb_per_node = 10000
# set a walltime of 10 minues
cluster.job_time = '10:00'
(3). Make a main function with your model and trainer. Each job will call this function with a particular hparams configuration.:
from pytorch_lightning import Trainer
def train_fx(trial_hparams, cluster_manager, _):
# hparams has a specific set of hyperparams
my_model = MyLightningModel()
# give the trainer the cluster object
trainer = Trainer()
trainer.fit(my_model)
`
(4). Start the grid/random search:
# run the models on the cluster
cluster.optimize_parallel_cluster_gpu(
train_fx,
nb_trials=20,
job_name='my_grid_search_exp_name',
job_display_name='my_exp')
Note
nb_trials specifies how many of the possible permutations to use. If using grid_search it will use the depth first ordering. If using random_search it will use the first k shuffled options. FYI, random search has been shown to be just as good as any Bayesian optimization method when using a reasonable number of samples (60), see this paper for more information.
Walltime auto-resubmit¶
Lightning automatically resubmits jobs when they reach the walltime. Make sure to set the SIGUSR1 signal in your SLURM script.:
# 90 seconds before training ends
#SBATCH --signal=SIGUSR1@90
When lightning receives the SIGUSR1 signal it will: 1. save a checkpoint with ‘hpc_ckpt’ in the name. 2. resubmit the job using the SLURM_JOB_ID
When the script starts again, Lightning will: 1. search for a ‘hpc_ckpt’ checkpoint. 2. restore the model, optimizers, schedulers, epoch, etc…
-
class
pytorch_lightning.trainer.distrib_data_parallel.
TrainerDDPMixin
[source]¶ Bases:
abc.ABC
-
check_horovod
()[source]¶ Raises a MisconfigurationException if the Trainer is not configured correctly for Horovod.
-
abstract
copy_trainer_model_properties
(*args)[source]¶ Warning: this is just empty shell for code implemented in other class.
-
ddp_train
(process_idx, model)[source]¶ Entry point into a DP thread :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.ddp_train.gpu_idx: :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.ddp_train.model: :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.ddp_train.cluster_obj: :return:
-
abstract
init_optimizers
(*args)[source]¶ Warning: this is just empty shell for code implemented in other class.
-
load_spawn_weights
(original_model)[source]¶ Load the temp weights saved in the process To recover the trained model from the ddp process we load the saved weights :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.load_spawn_weights.model: :return:
-
abstract
run_pretrain_routine
(*args)[source]¶ Warning: this is just empty shell for code implemented in other class.
-
save_spawn_weights
(model)[source]¶ Dump a temporary checkpoint after ddp ends to get weights out of the process :param _sphinx_paramlinks_pytorch_lightning.trainer.distrib_data_parallel.TrainerDDPMixin.save_spawn_weights.model: :return:
-