Shortcuts

pytorch_lightning.trainer.distrib_parts module

Lightning makes multi-gpu training and 16 bit training trivial.

Note

None of the flags below require changing anything about your lightningModel definition.

Choosing a backend

Lightning supports two backends. DataParallel and DistributedDataParallel.

Both can be used for single-node multi-GPU training. For multi-node training you must use DistributedDataParallel.

DataParallel (dp)

Splits a batch across multiple GPUs on the same node. Cannot be used for multi-node training.

DistributedDataParallel (ddp)

Trains a copy of the model on each GPU and only syncs gradients. If used with DistributedSampler, each GPU trains on a subset of the full dataset.

DistributedDataParallel-2 (ddp2)

Works like DDP, except each node trains a single copy of the model using ALL GPUs on that node.

Very useful when dealing with negative samples, etc…

You can toggle between each mode by setting this flag.

# DEFAULT (when using single GPU or no GPUs)
trainer = Trainer(distributed_backend=None)

# Change to DataParallel (gpus > 1)
trainer = Trainer(distributed_backend='dp')

# change to distributed data parallel (gpus > 1)
trainer = Trainer(distributed_backend='ddp')

# change to distributed data parallel (gpus > 1)
trainer = Trainer(distributed_backend='ddp2')
If you request multiple nodes, the back-end will auto-switch to ddp.

We recommend you use DistributedDataparallel even for single-node multi-GPU training. It is MUCH faster than DP but may have configuration issues depending on your cluster.

For a deeper understanding of what lightning is doing, feel free to read this

guide.

Distributed and 16-bit precision

Due to an issue with apex and DistributedDataParallel (PyTorch and NVIDIA issue), Lightning does

not allow 16-bit and DP training. We tried to get this to work, but it’s an issue on their end.

Below are the possible configurations we support.

1 GPU

1+ GPUs

DP

DDP

16-bit

command

Y

Trainer(gpus=1)

Y

Y

Trainer(gpus=1, use_amp=True)

Y

Y

Trainer(gpus=k, distributed_backend=’dp’)

Y

Y

Trainer(gpus=k, distributed_backend=’ddp’)

Y

Y

Y

Trainer(gpus=k, distributed_backend=’ddp’, use_amp=True)

You also have the option of specifying which GPUs to use by passing a list:

# DEFAULT (int) specifies how many GPUs to use.
Trainer(gpus=k)

# Above is equivalent to
Trainer(gpus=list(range(k)))

# You specify which GPUs (don't use if running on cluster)
Trainer(gpus=[0, 1])

# can also be a string
Trainer(gpus='0, 1')

# can also be -1 or '-1', this uses all available GPUs
# this is equivalent to list(range(torch.cuda.available_devices()))
Trainer(gpus=-1)

CUDA flags

CUDA flags make certain GPUs visible to your script.

Lightning sets these for you automatically, there’s NO NEED to do this yourself.

# lightning will set according to what you give the trainer
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
However, when using a cluster, Lightning will NOT set these flags (and you should not either).

SLURM will set these for you.

16-bit mixed precision

16 bit precision can cut your memory footprint by half. If using volta architecture GPUs

it can give a dramatic training speed-up as well. First, install apex (if install fails, look here):

$ git clone https://github.com/NVIDIA/apex
$ cd apex

# ------------------------
# OPTIONAL: on your cluster you might need to load cuda 10 or 9
# depending on how you installed PyTorch

# see available modules
module avail

# load correct cuda before install
module load cuda-10.0
# ------------------------

# make sure you've loaded a cuda version > 4.0 and < 7.0
module load gcc-6.1.0

$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

then set this use_amp to True.:

# DEFAULT
trainer = Trainer(amp_level='O2', use_amp=False)

Single-gpu

Make sure you’re on a GPU machine.:

# DEFAULT
trainer = Trainer(gpus=1)

Multi-gpu

Make sure you’re on a GPU machine. You can set as many GPUs as you want.

In this setting, the model will run on all 8 GPUs at once using DataParallel under the hood.

# to use DataParallel
trainer = Trainer(gpus=8, distributed_backend='dp')

# RECOMMENDED use DistributedDataParallel
trainer = Trainer(gpus=8, distributed_backend='ddp')

Custom device selection

The number of GPUs can also be selected with a list of indices or a string containing a comma separated list of GPU ids. The table below lists examples of possible input formats and how they are interpreted by Lightning. Note in particular the difference between gpus=0, gpus=[0] and gpus=”0”.

gpus

Type

Parsed

Meaning

None

NoneType

None

CPU

0

int

None

CPU

3

int

[0, 1, 2]

first 3 GPUs

-1

int

[0, 1, 2, …]

all available GPUs

[0]

list

[0]

GPU 0

[1, 3]

list

[1, 3]

GPUs 1 and 3

“0”

str

[0]

GPU 0

“3”

str

[3]

GPU 3

“1, 3”

str

[1, 3]

GPUs 1 and 3

“-1”

str

[0, 1, 2, …]

all available GPUs

Multi-node

Multi-node training is easily done by specifying these flags.

# train on 12*8 GPUs
trainer = Trainer(gpus=8, num_nodes=12, distributed_backend='ddp')
You must configure your job submission script correctly for the trainer to work.

Here is an example script for the above trainer configuration.

#!/bin/bash -l

# SLURM SUBMIT SCRIPT
#SBATCH --nodes=12
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0
#SBATCH --time=0-02:00:00

# activate conda env
conda activate my_env

# -------------------------
# OPTIONAL
# -------------------------
# debugging flags (optional)
# export NCCL_DEBUG=INFO
# export PYTHONFAULTHANDLER=1

# PyTorch comes with prebuilt NCCL support... but if you have issues with it
# you might need to load the latest version from your  modules
# module load NCCL/2.4.7-1-cuda.10.0

# on your cluster you might need these:
# set the network interface
# export NCCL_SOCKET_IFNAME=^docker0,lo
# -------------------------

# random port between 12k and 20k
export MASTER_PORT=$((12000 + RANDOM % 20000))

# run script from above
python my_main_file.py

Note

When running in DDP mode, any errors in your code will show up as an NCCL issue. Set the NCCL_DEBUG=INFO flag to see the ACTUAL error.

Normally now you would need to add a distributed sampler to your dataset, however Lightning automates this for you. But if you still need to set a sampler Lightning will not interfere nor automate it.

Here’s an example of how to add your own sampler (again no need with Lightning).

# ie: this:
dataset = myDataset()
dataloader = Dataloader(dataset)

# becomes:
dataset = myDataset()
dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = Dataloader(dataset, sampler=dist_sampler)

Auto-slurm-job-submission

Instead of manually building SLURM scripts, you can use the SlurmCluster object to do this for you. The SlurmCluster can also run a grid search if you pass in a HyperOptArgumentParser.

Here is an example where you run a grid search of 9 combinations of hyperparams. The full examples are here.

# grid search 3 values of learning rate and 3 values of number of layers for your net
# this generates 9 experiments (lr=1e-3, layers=16), (lr=1e-3, layers=32),
#  (lr=1e-3, layers=64), ... (lr=1e-1, layers=64)
parser = HyperOptArgumentParser(strategy='grid_search', add_help=False)
parser.opt_list('--learning_rate', default=0.001, type=float,
                options=[1e-3, 1e-2, 1e-1], tunable=True)
parser.opt_list('--layers', default=1, type=float, options=[16, 32, 64], tunable=True)
hyperparams = parser.parse_args()

# Slurm cluster submits 9 jobs, each with a set of hyperparams
cluster = SlurmCluster(
    hyperparam_optimizer=hyperparams,
    log_path='/some/path/to/save',
)

# OPTIONAL FLAGS WHICH MAY BE CLUSTER DEPENDENT
# which interface your nodes use for communication
cluster.add_command('export NCCL_SOCKET_IFNAME=^docker0,lo')

# see output of the NCCL connection process
# NCCL is how the nodes talk to each other
cluster.add_command('export NCCL_DEBUG=INFO')

# setting a master port here is a good idea.
cluster.add_command('export MASTER_PORT=%r' % PORT)

# ************** DON'T FORGET THIS ***************
# MUST load the latest NCCL version
cluster.load_modules(['NCCL/2.4.7-1-cuda.10.0'])

# configure cluster
cluster.per_experiment_nb_nodes = 12
cluster.per_experiment_nb_gpus = 8

cluster.add_slurm_cmd(cmd='ntasks-per-node', value=8, comment='1 task per gpu')

# submit a script with 9 combinations of hyper params
# (lr=1e-3, layers=16), (lr=1e-3, layers=32), (lr=1e-3, layers=64), ... (lr=1e-1, layers=64)
cluster.optimize_parallel_cluster_gpu(
    main,
    nb_trials=9, # how many permutations of the grid search to run
    job_name='name_for_squeue'
)

The other option is that you generate scripts on your own via a bash command or use another library…

Self-balancing architecture

Here lightning distributes parts of your module across available GPUs to optimize for speed and memory.

class pytorch_lightning.trainer.distrib_parts.TrainerDPMixin[source]

Bases: abc.ABC

_TrainerDPMixin__transfer_data_to_device(batch, device, gpu_id=None)[source]
copy_trainer_model_properties(model)[source]
dp_train(model)[source]
horovod_train(model)[source]
abstract init_optimizers(*args)[source]

Warning: this is just empty shell for code implemented in other class.

abstract run_pretrain_routine(*args)[source]

Warning: this is just empty shell for code implemented in other class.

single_gpu_train(model)[source]
tpu_train(tpu_core_idx, model)[source]
transfer_batch_to_gpu(batch, gpu_id)[source]
transfer_batch_to_tpu(batch)[source]
amp_level: str = None[source]
current_tpu_idx: ... = None[source]
data_parallel_device_ids: ... = None[source]
logger: Union[LightningLoggerBase, bool] = None[source]
on_gpu: bool = None[source]
precision: ... = None[source]
proc_rank: int = None[source]
progress_bar_callback: ... = None[source]
root_gpu: ... = None[source]
single_gpu: bool = None[source]
testing: bool = None[source]
tpu_global_core_rank: int = None[source]
tpu_local_core_rank: int = None[source]
abstract property use_amp[source]

this is just empty shell for code implemented in other class.

Type

Warning

Return type

bool

use_ddp: bool = None[source]
use_ddp2: bool = None[source]
use_dp: bool = None[source]
use_native_amp: bool = None[source]
use_tpu: bool = None[source]
pytorch_lightning.trainer.distrib_parts.check_gpus_data_type(gpus)[source]
Parameters

gpus – gpus parameter as passed to the Trainer Function checks that it is one of: None, Int, String or List Throws otherwise

Returns

return unmodified gpus variable

pytorch_lightning.trainer.distrib_parts.determine_root_gpu_device(gpus)[source]
Parameters

gpus – non empty list of ints representing which gpus to use

Returns

designated root GPU device

pytorch_lightning.trainer.distrib_parts.get_all_available_gpus()[source]
Returns

a list of all available gpus

pytorch_lightning.trainer.distrib_parts.normalize_parse_gpu_input_to_list(gpus)[source]
pytorch_lightning.trainer.distrib_parts.normalize_parse_gpu_string_input(s)[source]
pytorch_lightning.trainer.distrib_parts.parse_gpu_ids(gpus)[source]
Parameters

gpus – Int, string or list An int -1 or string ‘-1’ indicate that all available GPUs should be used. A list of ints or a string containing list of comma separated integers indicates specific GPUs to use An int 0 means that no GPUs should be used Any int N > 0 indicates that GPUs [0..N) should be used.

Returns

List of gpus to be used

If no GPUs are available but the value of gpus variable indicates request for GPUs then a misconfiguration exception is raised.

pytorch_lightning.trainer.distrib_parts.pick_multiple_gpus(n)[source]
pytorch_lightning.trainer.distrib_parts.pick_single_gpu(exclude_gpus=[])[source]
pytorch_lightning.trainer.distrib_parts.retry_jittered_backoff(f, num_retries=5)[source]
pytorch_lightning.trainer.distrib_parts.sanitize_gpu_ids(gpus)[source]
Parameters

gpus – list of ints corresponding to GPU indices Checks that each of the GPUs in the list is actually available. Throws if any of the GPUs is not available.

Returns

unmodified gpus variable