Callbacks¶
Lightning has a callback system to execute arbitrary code. Callbacks should capture NON-ESSENTIAL
logic that is NOT required for your LightningModule
to run.
An overall Lightning system should have:
Trainer for all engineering
LightningModule for all research code.
Callbacks for non-essential code.
Example:
class MyPrintingCallback(Callback):
def on_init_start(self, trainer):
print('Starting to init trainer!')
def on_init_end(self, trainer):
print('trainer is init now')
def on_train_end(self, trainer, pl_module):
print('do something when training ends')
trainer = Trainer(callbacks=[MyPrintingCallback()])
Starting to init trainer!
trainer is init now
We successfully extended functionality without polluting our super clean
LightningModule
research code.
Best Practices¶
The following are best practices when using/designing callbacks.
Callbacks should be isolated in their functionality.
Your callback should not rely on the behavior of other callbacks in order to work properly.
Do not manually call methods from the callback.
Directly calling methods (eg. on_validation_end) is strongly discouraged.
Whenever possible, your callbacks should not depend on the order in which they are executed.
Callback Base¶
Abstract base class used to build new callbacks.
-
class
pytorch_lightning.callbacks.base.
Callback
[source] Bases:
abc.ABC
Abstract base class used to build new callbacks.
-
on_batch_end
(trainer, pl_module)[source] Called when the training batch ends.
-
on_batch_start
(trainer, pl_module)[source] Called when the training batch begins.
-
on_epoch_end
(trainer, pl_module)[source] Called when the epoch ends.
-
on_epoch_start
(trainer, pl_module)[source] Called when the epoch begins.
-
on_fit_end
(trainer)[source] Called when fit ends
-
on_fit_start
(trainer)[source] Called when fit begins
-
on_init_end
(trainer)[source] Called when the trainer initialization ends, model has not yet been set.
-
on_init_start
(trainer)[source] Called when the trainer initialization begins, model has not yet been set.
-
on_keyboard_interrupt
(trainer, pl_module)[source] Called when the training is interrupted by KeyboardInterrupt.
-
on_sanity_check_end
(trainer, pl_module)[source] Called when the validation sanity check ends.
-
on_sanity_check_start
(trainer, pl_module)[source] Called when the validation sanity check starts.
-
on_test_batch_end
(trainer, pl_module)[source] Called when the test batch ends.
-
on_test_batch_start
(trainer, pl_module)[source] Called when the test batch begins.
-
on_test_end
(trainer, pl_module)[source] Called when the test ends.
-
on_test_start
(trainer, pl_module)[source] Called when the test begins.
-
on_train_end
(trainer, pl_module)[source] Called when the train ends.
-
on_train_start
(trainer, pl_module)[source] Called when the train begins.
-
on_validation_batch_end
(trainer, pl_module)[source] Called when the validation batch ends.
-
on_validation_batch_start
(trainer, pl_module)[source] Called when the validation batch begins.
-
on_validation_end
(trainer, pl_module)[source] Called when the validation loop ends.
-
on_validation_start
(trainer, pl_module)[source] Called when the validation loop begins.
-
setup
(trainer, stage)[source] Called when fit or test begins
-
teardown
(trainer, stage)[source] Called when fit or test ends
-
Early Stopping¶
Monitor a validation metric and stop training when it stops improving.
-
class
pytorch_lightning.callbacks.early_stopping.
EarlyStopping
(monitor='val_loss', min_delta=0.0, patience=3, verbose=False, mode='auto', strict=True)[source] Bases:
pytorch_lightning.callbacks.base.Callback
- Parameters
monitor¶ (
str
) – quantity to be monitored. Default:'val_loss'
.min_delta¶ (
float
) – minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement. Default:0
.patience¶ (
int
) – number of validation epochs with no improvement after which training will be stopped. Default:0
.mode¶ (
str
) – one of {auto, min, max}. In min mode, training will stop when the quantity monitored has stopped decreasing; in max mode it will stop when the quantity monitored has stopped increasing; in auto mode, the direction is automatically inferred from the name of the monitored quantity. Default:'auto'
.strict¶ (
bool
) – whether to crash the training if monitor is not found in the validation metrics. Default:True
.
Example:
>>> from pytorch_lightning import Trainer >>> from pytorch_lightning.callbacks import EarlyStopping >>> early_stopping = EarlyStopping('val_loss') >>> trainer = Trainer(early_stop_callback=early_stopping)
-
_validate_condition_metric
(logs)[source] Checks that the condition metric for early stopping is good
- Parameters
logs¶ – callback metrics from validation output
- Returns
True if specified metric is available
-
on_sanity_check_end
(trainer, pl_module)[source] Called when the validation sanity check ends.
-
on_train_end
(trainer, pl_module)[source] Called when the train ends.
-
on_validation_end
(trainer, pl_module)[source] Called when the validation loop ends.
Gradient Accumulator¶
Change gradient accumulation factor according to scheduling.
-
class
pytorch_lightning.callbacks.gradient_accumulation_scheduler.
GradientAccumulationScheduler
(scheduling)[source] Bases:
pytorch_lightning.callbacks.base.Callback
Change gradient accumulation factor according to scheduling.
Example:
>>> from pytorch_lightning import Trainer >>> from pytorch_lightning.callbacks import GradientAccumulationScheduler # at epoch 5 start accumulating every 2 batches >>> accumulator = GradientAccumulationScheduler(scheduling={5: 2}) >>> trainer = Trainer(callbacks=[accumulator]) # alternatively, pass the scheduling dict directly to the Trainer >>> trainer = Trainer(accumulate_grad_batches={5: 2})
-
on_epoch_start
(trainer, pl_module)[source] Called when the epoch begins.
-
Learning Rate Logger¶
Log learning rate for lr schedulers during training
-
class
pytorch_lightning.callbacks.lr_logger.
LearningRateLogger
[source] Bases:
pytorch_lightning.callbacks.base.Callback
Automatically logs learning rate for learning rate schedulers during training.
Example:
>>> from pytorch_lightning import Trainer >>> from pytorch_lightning.callbacks import LearningRateLogger >>> lr_logger = LearningRateLogger() >>> trainer = Trainer(callbacks=[lr_logger])
Logging names are automatically determined based on optimizer class name. In case of multiple optimizers of same type, they will be named Adam, Adam-1 etc. If a optimizer has multiple parameter groups they will be named Adam/pg1, Adam/pg2 etc. To control naming, pass in a name keyword in the construction of the learning rate schdulers
Example:
def configure_optimizer(self): optimizer = torch.optim.Adam(...) lr_scheduler = {'scheduler': torch.optim.lr_schedulers.LambdaLR(optimizer, ...) 'name': 'my_logging_name'} return [optimizer], [lr_scheduler]
-
_extract_lr
(trainer, interval)[source] Extracts learning rates for lr schedulers and saves information into dict structure.
-
on_batch_start
(trainer, pl_module)[source] Called when the training batch begins.
-
on_epoch_start
(trainer, pl_module)[source] Called when the epoch begins.
-
on_train_start
(trainer, pl_module)[source] Called before training, determines unique names for all lr schedulers in the case of multiple of the same type or in the case of multiple parameter groups
-
Model Checkpointing¶
Automatically save model checkpoints during training.
-
class
pytorch_lightning.callbacks.model_checkpoint.
ModelCheckpoint
(filepath=None, monitor='val_loss', verbose=False, save_last=False, save_top_k=1, save_weights_only=False, mode='auto', period=1, prefix='')[source] Bases:
pytorch_lightning.callbacks.base.Callback
Save the model after every epoch if it improves.
After training finishes, use
best_model_path
to retrieve the path to the best checkpoint file andbest_model_score
to retrieve its score.- Parameters
path to save the model file. Can contain named formatting options to be auto-filled.
Example:
# custom path # saves a file like: my/path/epoch_0.ckpt >>> checkpoint_callback = ModelCheckpoint('my/path/') # save any arbitrary metrics like `val_loss`, etc. in name # saves a file like: my/path/epoch=2-val_loss=0.2_other_metric=0.3.ckpt >>> checkpoint_callback = ModelCheckpoint( ... filepath='my/path/{epoch}-{val_loss:.2f}-{other_metric:.2f}' ... )
Can also be set to None, then it will be set to default location during trainer construction.
save_last¶ (
bool
) – always saves the model at the end of the epoch. Default:False
.save_top_k¶ (
int
) – if save_top_k == k, the best k models according to the quantity monitored will be saved. ifsave_top_k == 0
, no models are saved. ifsave_top_k == -1
, all models are saved. Please note that the monitors are checked every period epochs. ifsave_top_k >= 2
and the callback is called multiple times inside an epoch, the name of the saved file will be appended with a version count starting with v0.mode¶ (
str
) – one of {auto, min, max}. Ifsave_top_k != 0
, the decision to overwrite the current save file is made based on either the maximization or the minimization of the monitored quantity. For val_acc, this should be max, for val_loss this should be min, etc. In auto mode, the direction is automatically inferred from the name of the monitored quantity.save_weights_only¶ (
bool
) – ifTrue
, then only the model’s weights will be saved (model.save_weights(filepath)
), else the full model is saved (model.save(filepath)
).period¶ (
int
) – Interval (number of epochs) between checkpoints.
Example:
>>> from pytorch_lightning import Trainer >>> from pytorch_lightning.callbacks import ModelCheckpoint # saves checkpoints to 'my/path/' whenever 'val_loss' has a new min >>> checkpoint_callback = ModelCheckpoint(filepath='my/path/') >>> trainer = Trainer(checkpoint_callback=checkpoint_callback) # save epoch and val_loss in name # saves a file like: my/path/sample-mnist_epoch=02_val_loss=0.32.ckpt >>> checkpoint_callback = ModelCheckpoint( ... filepath='my/path/sample-mnist_{epoch:02d}-{val_loss:.2f}' ... ) # retrieve the best checkpoint after training checkpoint_callback = ModelCheckpoint(filepath='my/path/') trainer = Trainer(checkpoint_callback=checkpoint_callback) model = ... trainer.fit(model) checkpoint_callback.best_model_path
-
format_checkpoint_name
(epoch, metrics, ver=None)[source] Generate a filename according to the defined template.
Example:
>>> tmpdir = os.path.dirname(__file__) >>> ckpt = ModelCheckpoint(os.path.join(tmpdir, '{epoch}')) >>> os.path.basename(ckpt.format_checkpoint_name(0, {})) 'epoch=0.ckpt' >>> ckpt = ModelCheckpoint(os.path.join(tmpdir, '{epoch:03d}')) >>> os.path.basename(ckpt.format_checkpoint_name(5, {})) 'epoch=005.ckpt' >>> ckpt = ModelCheckpoint(os.path.join(tmpdir, '{epoch}-{val_loss:.2f}')) >>> os.path.basename(ckpt.format_checkpoint_name(2, dict(val_loss=0.123456))) 'epoch=2-val_loss=0.12.ckpt' >>> ckpt = ModelCheckpoint(os.path.join(tmpdir, '{missing:d}')) >>> os.path.basename(ckpt.format_checkpoint_name(0, {})) 'missing=0.ckpt'
-
on_train_start
(trainer, pl_module)[source] Determine model checkpoint save directory at runtime. References attributes from the Trainer’s logger to determine where to save checkpoints.
-
on_validation_end
(trainer, pl_module)[source] Called when the validation loop ends.
Progress Bars¶
Use or override one of the progress bar callbacks.
-
class
pytorch_lightning.callbacks.progress.
ProgressBar
(refresh_rate=1, process_position=0)[source] Bases:
pytorch_lightning.callbacks.progress.ProgressBarBase
This is the default progress bar used by Lightning. It prints to stdout using the
tqdm
package and shows up to four different bars:sanity check progress: the progress during the sanity check run
main progress: shows training + validation progress combined. It also accounts for multiple validation runs during training when
val_check_interval
is used.validation progress: only visible during validation; shows total progress over all validation datasets.
test progress: only active when testing; shows total progress over all test datasets.
For infinite datasets, the progress bar never ends.
If you want to customize the default
tqdm
progress bars used by Lightning, you can override specific methods of the callback class and pass your custom implementation to theTrainer
:Example:
class LitProgressBar(ProgressBar): def init_validation_tqdm(self): bar = super().init_validation_tqdm() bar.set_description('running validation ...') return bar bar = LitProgressBar() trainer = Trainer(callbacks=[bar])
- Parameters
refresh_rate¶ (
int
) – Determines at which rate (in number of batches) the progress bars get updated. Set it to0
to disable the display. By default, theTrainer
uses this implementation of the progress bar and sets the refresh rate to the value provided to theprogress_bar_refresh_rate
argument in theTrainer
.process_position¶ (
int
) – Set this to a value greater than0
to offset the progress bars by this many lines. This is useful when you have progress bars defined elsewhere and want to show all of them together. This corresponds toprocess_position
in theTrainer
.
-
disable
()[source] You should provide a way to disable the progress bar. The
Trainer
will call this to disable the output on processes that have a rank different from 0, e.g., in multi-node training.- Return type
None
-
enable
()[source] You should provide a way to enable the progress bar. The
Trainer
will call this in e.g. pre-training routines like the learning rate finder to temporarily enable and disable the main progress bar.- Return type
None
-
init_sanity_tqdm
()[source] Override this to customize the tqdm bar for the validation sanity run.
- Return type
tqdm
-
init_test_tqdm
()[source] Override this to customize the tqdm bar for testing.
- Return type
tqdm
-
init_train_tqdm
()[source] Override this to customize the tqdm bar for training.
- Return type
tqdm
-
init_validation_tqdm
()[source] Override this to customize the tqdm bar for validation.
- Return type
tqdm
-
on_batch_end
(trainer, pl_module)[source] Called when the training batch ends.
-
on_epoch_start
(trainer, pl_module)[source] Called when the epoch begins.
-
on_sanity_check_end
(trainer, pl_module)[source] Called when the validation sanity check ends.
-
on_sanity_check_start
(trainer, pl_module)[source] Called when the validation sanity check starts.
-
on_test_batch_end
(trainer, pl_module)[source] Called when the test batch ends.
-
on_test_end
(trainer, pl_module)[source] Called when the test ends.
-
on_test_start
(trainer, pl_module)[source] Called when the test begins.
-
on_train_end
(trainer, pl_module)[source] Called when the train ends.
-
on_train_start
(trainer, pl_module)[source] Called when the train begins.
-
on_validation_batch_end
(trainer, pl_module)[source] Called when the validation batch ends.
-
on_validation_end
(trainer, pl_module)[source] Called when the validation loop ends.
-
on_validation_start
(trainer, pl_module)[source] Called when the validation loop begins.
-
class
pytorch_lightning.callbacks.progress.
ProgressBarBase
[source] Bases:
pytorch_lightning.callbacks.base.Callback
The base class for progress bars in Lightning. It is a
Callback
that keeps track of the batch progress in theTrainer
. You should implement your highly custom progress bars with this as the base class.Example:
class LitProgressBar(ProgressBarBase): def __init__(self): super().__init__() # don't forget this :) self.enable = True def disable(self): self.enable = False def on_batch_end(self, trainer, pl_module): super().on_batch_end(trainer, pl_module) # don't forget this :) percent = (self.train_batch_idx / self.total_train_batches) * 100 sys.stdout.flush() sys.stdout.write(f'{percent:.01f} percent complete \r') bar = LitProgressBar() trainer = Trainer(callbacks=[bar])
-
disable
()[source] You should provide a way to disable the progress bar. The
Trainer
will call this to disable the output on processes that have a rank different from 0, e.g., in multi-node training.
-
enable
()[source] You should provide a way to enable the progress bar. The
Trainer
will call this in e.g. pre-training routines like the learning rate finder to temporarily enable and disable the main progress bar.
-
on_batch_end
(trainer, pl_module)[source] Called when the training batch ends.
-
on_epoch_start
(trainer, pl_module)[source] Called when the epoch begins.
-
on_init_end
(trainer)[source] Called when the trainer initialization ends, model has not yet been set.
-
on_test_batch_end
(trainer, pl_module)[source] Called when the test batch ends.
-
on_test_start
(trainer, pl_module)[source] Called when the test begins.
-
on_train_start
(trainer, pl_module)[source] Called when the train begins.
-
on_validation_batch_end
(trainer, pl_module)[source] Called when the validation batch ends.
-
on_validation_start
(trainer, pl_module)[source] Called when the validation loop begins.
-
property
test_batch_idx
[source] The current batch index being processed during testing. Use this to update your progress bar.
- Return type
-
property
total_test_batches
[source] The total number of training batches during testing, which may change from epoch to epoch. Use this to set the total number of iterations in the progress bar. Can return
inf
if the test dataloader is of infinite size.- Return type
-
property
total_train_batches
[source] The total number of training batches during training, which may change from epoch to epoch. Use this to set the total number of iterations in the progress bar. Can return
inf
if the training dataloader is of infinite size.- Return type
-
property
total_val_batches
[source] The total number of training batches during validation, which may change from epoch to epoch. Use this to set the total number of iterations in the progress bar. Can return
inf
if the validation dataloader is of infinite size.- Return type
-
property
train_batch_idx
[source] The current batch index being processed during training. Use this to update your progress bar.
- Return type
-
-
pytorch_lightning.callbacks.progress.
convert_inf
(x)[source] The tqdm doesn’t support inf values. We have to convert it to None.