RPCPlugin(rpc_timeout_sec=torch.distributed.rpc.constants.DEFAULT_RPC_TIMEOUT_SEC, parallel_devices=None, num_nodes=None, cluster_environment=None, sync_batchnorm=None, **kwargs)¶
Backbone for RPC Plugins built on top of DDP. RPC introduces different communication behaviour than DDP. Unlike DDP, processes potentially are not required to run the same code as the main process. This leads to edge cases where logic needs to be re-defined. This class contains special cases that need to be addressed when using RPC communication when building custom RPC Plugins.
rpc_save_model(trainer, save_model_fn, filepath)¶
Override to save model to disk. This is required as the main process will be required to handle aggregating model states from RPC processes.