HivemindStrategy¶

class pytorch_lightning.strategies.HivemindStrategy(target_batch_size, run_id='lightning_run', batch_size=None, delay_state_averaging=False, delay_optimizer_step=None, delay_grad_averaging=False, offload_optimizer=None, reuse_grad_buffers=False, scheduler_fn=None, matchmaking_time=5.0, averaging_timeout=30.0, verbose=False, averager_opts=None, host_maddrs=None, initial_peers=None, **optimizer_kwargs)[source]¶

Bases: pytorch_lightning.strategies.strategy.Strategy

Provides capabilities to train using the Hivemind Library, training collaboratively across the internet with unreliable machines. For more information, refer to the docs.

Warning

HivemindStrategy is experimental and subject to change.

Parameters

target_batch_size¶ (int) – When training, the batch size to accumulate to before running a step. The larger this batch size, the more work can be done asynchronously without communication.
run_id¶ (str) – A unique identifier of this training run, used as a common prefix for all DHT keys. See https://learning-at-home.readthedocs.io/en/latest/user/dht.html.
batch_size¶ (Optional[int]) – The local batch size per process. If not provided, we infer this from the first batch of data passed in at training (lazy). Note that this should not change throughout training.
delay_state_averaging¶ (bool) – If enabled (default), average parameters and extra tensors in a background thread; if set to False, average parameters synchronously within the corresponding hivemind.Optimizer.step() call.
delay_optimizer_step¶ (Optional[bool]) – Run optimizer in background, apply results in future .step. requires offload_optimizer.
delay_grad_averaging¶ (bool) – Average gradients in background; requires offload_optimizer and delay_optimizer_step.
offload_optimizer¶ (Optional[bool]) – Offload the optimizer to host memory, saving GPU memory for parameters and gradients.
reuse_grad_buffers¶ (bool) – Use the model’s gradient buffers (params.grad) for gradient accumulation which is more memory efficient. Lightning will automatically disable zero_grad in the LightningModule.
scheduler_fn¶ (Optional[Callable]) – callable(optimizer) -> PyTorch LRScheduler or a pre-initialized PyTorch scheduler. When using offload_optimizer/delay_optimizer_step/delay_state_averaging scheduler_fn is required to be passed to the HivemindStrategy. This is because the optimizer is re-created and the scheduler needs to be re-created as well.
matchmaking_time¶ (float) – When looking for group, wait for peers to join for up to this many seconds. Increase if you see “averaged gradients with N peers” where N is below 0.9x on >=25% of epochs. Training with low-latency network, decreasing matchmaking_time allows training with smaller batch sizes.
averaging_timeout¶ (float) – If an averaging step hangs for this long, it will be cancelled automatically. Increase averaging_timeout if you see “Proceeding with local gradients” at least 25% of the time. Do not set this timeout too high, as it may cause your optimizer to hang after some types of network errors.
verbose¶ (bool) – Report internal Hivemind events such as accumulating gradients and running background tasks.
averager_opts¶ (Optional[Dict]) – Additional keyword arguments forwarded to both GradientAverager and TrainingStateAverager.
host_maddrs¶ (Optional[List]) – List of multi-addrs to create visible peers for other processes. https://learning-at-home.readthedocs.io/en/latest/user/dht.html#running-across-the-internet
initial_peers¶ (Union[str, List, None]) – If connecting to a running process, a list of initial peers needs to be passed in. This can also be set via the env variable INITIAL_PEERS.
**optimizer_kwargs¶ – kwargs are passed to the hivemind.Optimizer class.

all_gather(tensor, group=None, sync_grads=False)[source]¶

Perform an all_gather on all processes.

Parameters

tensor¶ (Tensor) – the tensor to all_gather
group¶ (Optional[Any]) – the process group to gather results from
sync_grads¶ (bool) – flag that allows users to synchronize gradients for all_gather op

Return type

Tensor

barrier(*args, **kwargs)[source]¶

Synchronizes all processes which blocks processes until the whole group enters this function.

Parameters: name¶ – an optional name to pass into barrier.
Return type: None

broadcast(obj, src=0)[source]¶

Broadcasts an object to all processes.

Parameters

obj¶ (TypeVar(TBroadcast)) – the object to broadcast
src¶ (int) – source rank

Return type

TypeVar(TBroadcast)

model_to_device()[source]¶

Moves the model to the correct device.

Return type: None

on_train_batch_start(batch, batch_idx, dataloader_idx=0)[source]¶

Called in the training loop before anything happens for that batch.

Return type: None

reduce(tensor, *args, **kwargs)[source]¶

Reduces the given tensor (e.g. across GPUs/processes).

Parameters

tensor¶ (Union[Any, Tensor]) – the tensor to sync and reduce
group¶ – the process group to reduce
reduce_op¶ – the reduction operation. Defaults to ‘mean’. Can also be a string ‘sum’ or ReduceOp.

Return type

Union[Any, Tensor]

setup(trainer)[source]¶

Setup plugins for the trainer fit and creates optimizers.

Parameters: trainer¶ (Trainer) – the trainer instance
Return type: None

teardown()[source]¶

This method is called to teardown the training process.

It is the right place to release memory and free other resources.

Return type: None

property is_global_zero: bool¶

Whether the current process is the rank zero process not only on the local node, but for all nodes.

Return type: bool

property root_device: torch.device¶

Returns the root device.

Return type: device