RPCSequentialPlugin¶

class pytorch_lightning.plugins.training_type.RPCSequentialPlugin(balance=None, microbatches=8, checkpoint='except_last', balance_mode='balance_by_size', pipelined_backward=True, rpc_timeout_sec=torch.distributed.rpc.constants.DEFAULT_RPC_TIMEOUT_SEC, **kwargs)[source]¶

Bases: pytorch_lightning.plugins.training_type.rpc.RPCPlugin

Provides sequential model parallelism for nn.Sequential module. If the module requires lots of memory, Pipe can be used to reduce this by leveraging multiple GPUs.

Pipeline parallelism comes with with checkpointing to reduce peak memory required to train while minimizing device under-utilization. This is turned on by default and can be turned off via the checkpoint argument.

You should determine the balance when defining the plugin, or you can pass an example input array via the LightningModule to infer a balance. The module will be partitioned into multiple devices according to the given balance. You may also rely on your own heuristics to find your own optimal configuration.

Parameters

balance¶ (Optional[List[int]]) – The balance of the model, i.e [2, 2] (two layers on each GPU).
not provided assumes user provides an input example array to find a balance on all GPUs.¶ (If) –
microbatches¶ (int) – Allows for parallelization to reduce device utilization
splitting the batch into further smaller batches.¶ (by) –
checkpoint¶ (str) – Enables gradient checkpointing. [‘always’, ‘except_last’, ‘never’]
balance_mode¶ (str) –
Type of balance heuristic to use if balance to be inferred.
- ’balance_by_size’: checks memory usage of each layer and determines balance
- ’balance_by_time’: checks time of each layer and determines balance
pipelined_backward¶ (Optional[bool]) – if True, call torch.autograd.backward once per microbatch on the
pass¶ (backward) –
a potential deadlock in pytorch when using tensor parallelism¶ (around) –
Defaults to True if¶ (at) –
> 1¶ (get_model_parallel_world_size()) –

barrier(name=None)[source]¶

Forces all possibly joined processes to wait for each other

Return type: None

post_optimizer_step(optimizer, optimizer_idx, **kwargs)[source]¶

Hook to do something after each optimizer step.

Return type: None

pre_backward(closure_loss, should_accumulate, optimizer, opt_idx)[source]¶: Run before precision plugin executes backward

rpc_save_model(trainer, save_model_fn, filepath)[source]¶

Override to save model to disk. This is required as the main process will be required to handle aggregating model states from RPC processes.

Parameters

trainer¶ – The trainer object.
save_model_fn¶ (Callable) – The saving function to save final model.
filepath¶ (str) – The filepath to save the model to.

Return type

None