Shortcuts

RPCSequentialPlugin

class pytorch_lightning.plugins.training_type.RPCSequentialPlugin(balance=None, microbatches=8, checkpoint='except_last', balance_mode='balance_by_size', pipelined_backward=True, rpc_timeout_sec=torch.distributed.rpc.constants.DEFAULT_RPC_TIMEOUT_SEC, **kwargs)[source]

Bases: pytorch_lightning.plugins.training_type.rpc.RPCPlugin

Provides sequential model parallelism for nn.Sequential module. If the module requires lots of memory, Pipe can be used to reduce this by leveraging multiple GPUs.

Pipeline parallelism comes with with checkpointing to reduce peak memory required to train while minimizing device under-utilization. This is turned on by default and can be turned off via the checkpoint argument.

You should determine the balance when defining the plugin, or you can pass an example input array via the LightningModule to infer a balance. The module will be partitioned into multiple devices according to the given balance. You may also rely on your own heuristics to find your own optimal configuration.

Parameters
  • balance (Optional[List[int]]) – The balance of the model, i.e [2, 2] (two layers on each GPU).

  • not provided assumes user provides an input example array to find a balance on all GPUs. (If) –

  • microbatches (int) – Allows for parallelization to reduce device utilization

  • splitting the batch into further smaller batches. (by) –

  • checkpoint (str) – Enables gradient checkpointing. [‘always’, ‘except_last’, ‘never’]

  • balance_mode (str) –

    Type of balance heuristic to use if balance to be inferred.

    • ’balance_by_size’: checks memory usage of each layer and determines balance

    • ’balance_by_time’: checks time of each layer and determines balance

  • pipelined_backward (Optional[bool]) – if True, call torch.autograd.backward once per microbatch on the

  • pass (backward) –

  • a potential deadlock in pytorch when using tensor parallelism (around) –

  • Defaults to True if (at) –

  • > 1 (get_model_parallel_world_size()) –

barrier(name=None)[source]

Forces all possibly joined processes to wait for each other

Return type

None

post_optimizer_step(optimizer, optimizer_idx, **kwargs)[source]

Hook to do something after each optimizer step.

Return type

None

pre_backward(closure_loss, should_accumulate, optimizer, opt_idx)[source]

Run before precision plugin executes backward

rpc_save_model(trainer, save_model_fn, filepath)[source]

Override to save model to disk. This is required as the main process will be required to handle aggregating model states from RPC processes.

Parameters
  • trainer – The trainer object.

  • save_model_fn (Callable) – The saving function to save final model.

  • filepath (str) – The filepath to save the model to.

Return type

None