pytorch_lightning.trainer.distrib_parts module¶

Lightning makes multi-gpu training and 16 bit training trivial.

Note

None of the flags below require changing anything about your lightningModel definition.

Choosing a backend¶

Lightning supports two backends. DataParallel and DistributedDataParallel.: Both can be used for single-node multi-GPU training. For multi-node training you must use DistributedDataParallel.

DataParallel (dp)¶

Splits a batch across multiple GPUs on the same node. Cannot be used for multi-node training.

DistributedDataParallel (ddp)¶

Trains a copy of the model on each GPU and only syncs gradients. If used with DistributedSampler, each GPU trains on a subset of the full dataset.

DistributedDataParallel-2 (ddp2)¶

Works like DDP, except each node trains a single copy of the model using ALL GPUs on that node.: Very useful when dealing with negative samples, etc…

You can toggle between each mode by setting this flag.

# DEFAULT (when using single GPU or no GPUs)
trainer = Trainer(distributed_backend=None)

# Change to DataParallel (gpus > 1)
trainer = Trainer(distributed_backend='dp')

# change to distributed data parallel (gpus > 1)
trainer = Trainer(distributed_backend='ddp')

# change to distributed data parallel (gpus > 1)
trainer = Trainer(distributed_backend='ddp2')

If you request multiple nodes, the back-end will auto-switch to ddp.: We recommend you use DistributedDataparallel even for single-node multi-GPU training. It is MUCH faster than DP but may have configuration issues depending on your cluster.
For a deeper understanding of what lightning is doing, feel free to read this: guide.

Distributed and 16-bit precision¶

Due to an issue with apex and DistributedDataParallel (PyTorch and NVIDIA issue), Lightning does: not allow 16-bit and DP training. We tried to get this to work, but it’s an issue on their end.

Below are the possible configurations we support.

1 GPU	1+ GPUs	DP	DDP	16-bit	command
Y					Trainer(gpus=1)
Y				Y	Trainer(gpus=1, use_amp=True)
	Y	Y			Trainer(gpus=k, distributed_backend=’dp’)
	Y		Y		Trainer(gpus=k, distributed_backend=’ddp’)
	Y		Y	Y	Trainer(gpus=k, distributed_backend=’ddp’, use_amp=True)

You also have the option of specifying which GPUs to use by passing a list:

# DEFAULT (int) specifies how many GPUs to use.
Trainer(gpus=k)

# Above is equivalent to
Trainer(gpus=list(range(k)))

# You specify which GPUs (don't use if running on cluster)
Trainer(gpus=[0, 1])

# can also be a string
Trainer(gpus='0, 1')

# can also be -1 or '-1', this uses all available GPUs
# this is equivalent to list(range(torch.cuda.available_devices()))
Trainer(gpus=-1)

CUDA flags¶

CUDA flags make certain GPUs visible to your script.: Lightning sets these for you automatically, there’s NO NEED to do this yourself.

# lightning will set according to what you give the trainer
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

However, when using a cluster, Lightning will NOT set these flags (and you should not either).: SLURM will set these for you.

16-bit mixed precision¶

16 bit precision can cut your memory footprint by half. If using volta architecture GPUs

it can give a dramatic training speed-up as well. First, install apex (if install fails, look here):

$ git clone https://github.com/NVIDIA/apex
$ cd apex

# ------------------------
# OPTIONAL: on your cluster you might need to load cuda 10 or 9
# depending on how you installed PyTorch

# see available modules
module avail

# load correct cuda before install
module load cuda-10.0
# ------------------------

# make sure you've loaded a cuda version > 4.0 and < 7.0
module load gcc-6.1.0

$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

then set this use_amp to True.:

# DEFAULT
trainer = Trainer(amp_level='O2', use_amp=False)

Single-gpu¶

Make sure you’re on a GPU machine.:

# DEFAULT
trainer = Trainer(gpus=1)

Multi-gpu¶

Make sure you’re on a GPU machine. You can set as many GPUs as you want.: In this setting, the model will run on all 8 GPUs at once using DataParallel under the hood.

# to use DataParallel
trainer = Trainer(gpus=8, distributed_backend='dp')

# RECOMMENDED use DistributedDataParallel
trainer = Trainer(gpus=8, distributed_backend='ddp')

Custom device selection¶

The number of GPUs can also be selected with a list of indices or a string containing a comma separated list of GPU ids. The table below lists examples of possible input formats and how they are interpreted by Lightning. Note in particular the difference between gpus=0, gpus=[0] and gpus=”0”.

gpus	Type	Parsed	Meaning
None	NoneType	None	CPU
0	int	None	CPU
3	int	[0, 1, 2]	first 3 GPUs
-1	int	[0, 1, 2, …]	all available GPUs
[0]	list	[0]	GPU 0
[1, 3]	list	[1, 3]	GPUs 1 and 3
“0”	str	[0]	GPU 0
“3”	str	[3]	GPU 3
“1, 3”	str	[1, 3]	GPUs 1 and 3
“-1”	str	[0, 1, 2, …]	all available GPUs

Multi-node¶

Multi-node training is easily done by specifying these flags.

# train on 12*8 GPUs
trainer = Trainer(gpus=8, num_nodes=12, distributed_backend='ddp')

You must configure your job submission script correctly for the trainer to work.: Here is an example script for the above trainer configuration.

#!/bin/bash -l

# SLURM SUBMIT SCRIPT
#SBATCH --nodes=12
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
#SBATCH --time=0-02:00:00

# activate conda env
conda activate my_env

# -------------------------
# OPTIONAL
# -------------------------
# debugging flags (optional)
# export NCCL_DEBUG=INFO
# export PYTHONFAULTHANDLER=1

# PyTorch comes with prebuilt NCCL support... but if you have issues with it
# you might need to load the latest version from your  modules
# module load NCCL/2.4.7-1-cuda.10.0

# on your cluster you might need these:
# set the network interface
# export NCCL_SOCKET_IFNAME=^docker0,lo
# -------------------------

# random port between 12k and 20k
export MASTER_PORT=$((12000 + RANDOM % 20000))

# run script from above
python my_main_file.py

Note

When running in DDP mode, any errors in your code will show up as an NCCL issue. Set the NCCL_DEBUG=INFO flag to see the ACTUAL error.

Normally now you would need to add a distributed sampler to your dataset, however Lightning automates this for you. But if you still need to set a sampler Lightning will not interfere nor automate it.

Here’s an example of how to add your own sampler (again no need with Lightning).

# ie: this:
dataset = myDataset()
dataloader = Dataloader(dataset)

# becomes:
dataset = myDataset()
dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = Dataloader(dataset, sampler=dist_sampler)

Auto-slurm-job-submission¶

Instead of manually building SLURM scripts, you can use the SlurmCluster object to do this for you. The SlurmCluster can also run a grid search if you pass in a HyperOptArgumentParser.

Here is an example where you run a grid search of 9 combinations of hyperparams. The full examples are here.

# grid search 3 values of learning rate and 3 values of number of layers for your net
# this generates 9 experiments (lr=1e-3, layers=16), (lr=1e-3, layers=32),
#  (lr=1e-3, layers=64), ... (lr=1e-1, layers=64)
parser = HyperOptArgumentParser(strategy='grid_search', add_help=False)
parser.opt_list('--learning_rate', default=0.001, type=float,
                options=[1e-3, 1e-2, 1e-1], tunable=True)
parser.opt_list('--layers', default=1, type=float, options=[16, 32, 64], tunable=True)
hyperparams = parser.parse_args()

# Slurm cluster submits 9 jobs, each with a set of hyperparams
cluster = SlurmCluster(
    hyperparam_optimizer=hyperparams,
    log_path='/some/path/to/save',
)

# OPTIONAL FLAGS WHICH MAY BE CLUSTER DEPENDENT
# which interface your nodes use for communication
cluster.add_command('export NCCL_SOCKET_IFNAME=^docker0,lo')

# see output of the NCCL connection process
# NCCL is how the nodes talk to each other
cluster.add_command('export NCCL_DEBUG=INFO')

# setting a master port here is a good idea.
cluster.add_command('export MASTER_PORT=%r' % PORT)

# ************** DON'T FORGET THIS ***************
# MUST load the latest NCCL version
cluster.load_modules(['NCCL/2.4.7-1-cuda.10.0'])

# configure cluster
cluster.per_experiment_nb_nodes = 12
cluster.per_experiment_nb_gpus = 8

cluster.add_slurm_cmd(cmd='ntasks-per-node', value=8, comment='1 task per gpu')

# submit a script with 9 combinations of hyper params
# (lr=1e-3, layers=16), (lr=1e-3, layers=32), (lr=1e-3, layers=64), ... (lr=1e-1, layers=64)
cluster.optimize_parallel_cluster_gpu(
    main,
    nb_trials=9, # how many permutations of the grid search to run
    job_name='name_for_squeue'
)

The other option is that you generate scripts on your own via a bash command or use another library…

Self-balancing architecture¶

Here lightning distributes parts of your module across available GPUs to optimize for speed and memory.

class pytorch_lightning.trainer.distrib_parts.TrainerDPMixin[source]¶

Bases: abc.ABC

_TrainerDPMixin__transfer_data_to_device(batch, device, gpu_id=None)[source]¶

copy_trainer_model_properties(model)[source]¶

dp_train(model)[source]¶

horovod_train(model)[source]¶

abstract init_optimizers(*args)[source]¶: Warning: this is just empty shell for code implemented in other class.

abstract run_pretrain_routine(*args)[source]¶: Warning: this is just empty shell for code implemented in other class.

single_gpu_train(model)[source]¶

tpu_train(tpu_core_idx, model)[source]¶

transfer_batch_to_gpu(batch, gpu_id)[source]¶

transfer_batch_to_tpu(batch)[source]¶

amp_level: str = None[source]¶

current_tpu_idx: ... = None[source]¶

data_parallel_device_ids: ... = None[source]¶

logger: Union[LightningLoggerBase, bool] = None[source]¶

on_gpu: bool = None[source]¶

precision: ... = None[source]¶

proc_rank: int = None[source]¶

progress_bar_callback: ... = None[source]¶

root_gpu: ... = None[source]¶

single_gpu: bool = None[source]¶

testing: bool = None[source]¶

tpu_global_core_rank: int = None[source]¶

tpu_local_core_rank: int = None[source]¶

abstract property use_amp[source]¶

this is just empty shell for code implemented in other class.

Type: Warning
Return type: bool

use_ddp: bool = None[source]¶

use_ddp2: bool = None[source]¶

use_dp: bool = None[source]¶

use_native_amp: bool = None[source]¶

use_tpu: bool = None[source]¶

pytorch_lightning.trainer.distrib_parts.check_gpus_data_type(gpus)[source]¶

Parameters: gpus¶ – gpus parameter as passed to the Trainer Function checks that it is one of: None, Int, String or List Throws otherwise
Returns: return unmodified gpus variable

pytorch_lightning.trainer.distrib_parts.determine_root_gpu_device(gpus)[source]¶

Parameters: gpus¶ – non empty list of ints representing which gpus to use
Returns: designated root GPU device

pytorch_lightning.trainer.distrib_parts.get_all_available_gpus()[source]¶

Returns: a list of all available gpus

pytorch_lightning.trainer.distrib_parts.normalize_parse_gpu_input_to_list(gpus)[source]¶

pytorch_lightning.trainer.distrib_parts.normalize_parse_gpu_string_input(s)[source]¶

pytorch_lightning.trainer.distrib_parts.parse_gpu_ids(gpus)[source]¶

Parameters

gpus¶ – Int, string or list An int -1 or string ‘-1’ indicate that all available GPUs should be used. A list of ints or a string containing list of comma separated integers indicates specific GPUs to use An int 0 means that no GPUs should be used Any int N > 0 indicates that GPUs [0..N) should be used.

Returns

List of gpus to be used

If no GPUs are available but the value of gpus variable indicates request for GPUs then a misconfiguration exception is raised.

pytorch_lightning.trainer.distrib_parts.pick_multiple_gpus(n)[source]¶

pytorch_lightning.trainer.distrib_parts.pick_single_gpu(exclude_gpus=[])[source]¶

pytorch_lightning.trainer.distrib_parts.retry_jittered_backoff(f, num_retries=5)[source]¶

pytorch_lightning.trainer.distrib_parts.sanitize_gpu_ids(gpus)[source]¶

Parameters: gpus¶ – list of ints corresponding to GPU indices Checks that each of the GPUs in the list is actually available. Throws if any of the GPUs is not available.
Returns: unmodified gpus variable