Shortcuts

Result

Lightning has two results objects TrainResult and EvalResult.

Use these to control:

  • When to log (each step and/or epoch aggregate).

  • Where to log (progress bar or a logger).

  • How to sync across accelerators.


Training loop example

Return a TrainResult from the Training loop.

def training_step(self, batch_subset, batch_idx):
    loss = ...
    result = pl.TrainResult(minimize=loss)
    result.log('train_loss', loss, prog_bar=True)
    return result

If you’d like to do something special with the outputs other than logging, implement __epoch_end.

def training_step(self, batch, batch_idx):
    result = pl.TrainResult(loss)
    result.some_prediction = some_prediction
    return result

def training_epoch_end(self, training_step_output_result):
    all_train_predictions = training_step_output_result.some_prediction

    training_step_output_result.some_new_prediction = some_new_prediction
    return training_step_output_result

Validation/Test loop example

Return a EvalResult object from a validation/test loop

def validation_step(self, batch, batch_idx):
    some_metric = ...
    result = pl.EvalResult(checkpoint_on=some_metric)
    result.log('some_metric', some_metric, prog_bar=True)
    return result

If you’d like to do something special with the outputs other than logging, implement __epoch_end.

def validation_step(self, batch, batch_idx):
    result = pl.EvalResult(checkpoint_on=some_metric)
    result.a_prediction = some_prediction
    return result

def validation_epoch_end(self, validation_step_output_result):
    all_validation_step_predictions = validation_step_output_result.a_prediction
    # do something with the predictions from all validation_steps

    return validation_step_output_result

With the equivalent using the EvalResult syntax


TrainResult

The TrainResult basic usage is this:

minimize

def training_step(...):
    return TrainResult(some_metric)

checkpoint/early_stop

If you are only using a training loop (no val), you can also specify what to monitor for checkpointing or early stopping:

def training_step(...):
    return TrainResult(some_metric, checkpoint_on=metric_a, early_stop_on=metric_b)

In the manual loop, checkpoint and early stop is based only on the loss returned. With the TrainResult you can change it every batch if you want, or even monitor different metrics for each purpose.

# early stop + checkpoint can only use the `loss` when done manually via dictionaries
def training_step(...):
    return loss
def training_step(...):
    return {'loss' loss}

logging

The main benefit of the TrainResult is automatic logging at whatever level you want.

result = TrainResult(loss)
result.log('train_loss', loss)

# equivalent
result.log('train_loss', loss, on_step=True, on_epoch=False, logger=True, prog_bar=False, reduce_fx=torch.mean)

By default, any log calls will log only that step’s metrics to the logger. To change when and where to log update the defaults as needed.

Change where to log:

# to logger only (default)
result.log('train_loss', loss)

# logger + progress bar
result.log('train_loss', loss, prog_bar=True)

# progress bar only
result.log('train_loss', loss, prog_bar=True, logger=False)

Sometimes you may also want to get epoch level statistics:

# loss at this step
result.log('train_loss', loss)

# loss for the epoch
result.log('train_loss', loss, on_step=False, on_epoch=True)

# loss for the epoch AND step
# the logger will show 2 charts: step_train_loss, epoch_train_loss
result.log('train_loss', loss, on_epoch=True)

Finally, you can use your own reduction function instead:

# the total sum for all batches of an epoch
result.log('train_loss', loss, on_epoch=True, reduce_fx=torch.sum)

def my_reduce_fx(all_train_loss):
    # reduce somehow
    return result

result.log('train_loss', loss, on_epoch=True, reduce_fx=my_reduce_fx)

Note

Use this ONLY in the case where your loop is simple and simply logs.

Finally, you may need more esoteric logging such as something specific to your logger like images:

def training_step(...):
    result = TrainResult(some_metric)
    result.log('train_loss', loss)

    # also log images (if tensorboard for example)
    self.logger.experiment.log_figure(...)

Sync across devices

When training on multiple GPUs/CPUs/TPU cores, calculate the global mean of a logged metric as follows:

result.log('train_loss', loss, sync_dist=True)

TrainResult API

class pytorch_lightning.core.step_result.TrainResult(minimize=None, early_stop_on=None, checkpoint_on=None, hiddens=None)[source]

Bases: pytorch_lightning.core.step_result.Result

Used in train loop to auto-log to a logger or progress bar without needing to define a train_step_end or train_epoch_end method

Example:

def training_step(self, batch, batch_idx):
    loss = ...
    result = pl.TrainResult(loss)
    result.log('train_loss', loss)
    return result

# without val/test loop can model checkpoint or early stop
def training_step(self, batch, batch_idx):
    loss = ...
    result = pl.TrainResult(loss, early_stop_on=loss, checkpoint_on=loss)
    result.log('train_loss', loss)
    return result
Parameters
log(name, value, prog_bar=False, logger=True, on_step=True, on_epoch=False, reduce_fx=torch.mean, tbptt_reduce_fx=torch.mean, tbptt_pad_token=0, enable_graph=False, sync_dist=False, sync_dist_op='mean', sync_dist_group=None)[source]

Log a key, value

Example:

result.log('train_loss', loss)

# defaults used
result.log(
    name,
    value,
    on_step=True,
    on_epoch=False,
    logger=True,
    prog_bar=False,
    reduce_fx=torch.mean,
    enable_graph=False
)
Parameters
  • name – key name

  • value – value name

  • prog_bar (bool) – if True logs to the progress base

  • logger (bool) – if True logs to the logger

  • on_step (bool) – if True logs the output of validation_step or test_step

  • on_epoch (bool) – if True, logs the output of the training loop aggregated

  • reduce_fx (Callable) – Torch.mean by default

  • tbptt_reduce_fx (Callable) – function to reduce on truncated back prop

  • tbptt_pad_token (int) – token to use for padding

  • enable_graph (bool) – if True, will not auto detach the graph

  • sync_dist (bool) – if True, reduces the metric across GPUs/TPUs

  • sync_dist_op (Union[Any, str]) – the op to sync across

  • sync_dist_group (Optional[Any]) – the ddp group

log_dict(dictionary, prog_bar=False, logger=True, on_step=False, on_epoch=True, reduce_fx=torch.mean, tbptt_reduce_fx=torch.mean, tbptt_pad_token=0, enable_graph=False, sync_dist=False, sync_dist_op='mean', sync_dist_group=None)[source]

Log a dictonary of values at once

Example:

values = {'loss': loss, 'acc': acc, ..., 'metric_n': metric_n}
result.log_dict(values)
Parameters
  • dictionary (dict) – key value pairs (str, tensors)

  • prog_bar (bool) – if True logs to the progress base

  • logger (bool) – if True logs to the logger

  • on_step (bool) – if True logs the output of validation_step or test_step

  • on_epoch (bool) – if True, logs the output of the training loop aggregated

  • reduce_fx (Callable) – Torch.mean by default

  • tbptt_reduce_fx (Callable) – function to reduce on truncated back prop

  • tbptt_pad_token (int) – token to use for padding

  • enable_graph (bool) – if True, will not auto detach the graph

  • sync_dist (bool) – if True, reduces the metric across GPUs/TPUs

  • sync_dist_op (Union[Any, str]) – the op to sync across

  • sync_dist_group (Optional[Any]) – the ddp group:


EvalResult

The EvalResult object has the same usage as the TrainResult object.

def validation_step(...):
    return EvalResult()

def test_step(...):
    return EvalResult()

However, there are some differences:

Eval minimize

  • There is no minimize argument (since we don’t learn during validation)

Eval checkpoint/early_stopping

If defined in both the TrainResult and the EvalResult the one in the EvalResult will take precedence.

def training_step(...):
    return TrainResult(loss, checkpoint_on=metric, early_stop_on=metric)

# metric_a and metric_b will be used for the callbacks and NOT metric
def validation_step(...):
    return EvalResult(checkpoint_on=metric_a, early_stop_on=metric_b)

Eval logging

Logging has the same behavior as TrainResult but the logging defaults are different:

# TrainResult logs by default at each step only
TrainResult().log('val', val, on_step=True, on_epoch=False, logger=True, prog_bar=False, reduce_fx=torch.mean)

# EvalResult logs by default at the end of an epoch only
EvalResult().log('val', val, on_step=False, on_epoch=True, logger=True, prog_bar=False, reduce_fx=torch.mean)

Val/Test loop

Eval result can be used in both test_step and validation_step.

Sync across devices (v)

When training on multiple GPUs/CPUs/TPU cores, calculate the global mean of a logged metric as follows:

result.log('val_loss', loss, sync_dist=True)

EvalResult API

class pytorch_lightning.core.step_result.EvalResult(early_stop_on=None, checkpoint_on=None, hiddens=None)[source]

Bases: pytorch_lightning.core.step_result.Result

Used in val/train loop to auto-log to a logger or progress bar without needing to define a _step_end or _epoch_end method

Example:

def validation_step(self, batch, batch_idx):
    loss = ...
    result = EvalResult()
    result.log('val_loss', loss)
    return result

def test_step(self, batch, batch_idx):
    loss = ...
    result = EvalResult()
    result.log('val_loss', loss)
    return result
Parameters
log(name, value, prog_bar=False, logger=True, on_step=False, on_epoch=True, reduce_fx=torch.mean, tbptt_reduce_fx=torch.mean, tbptt_pad_token=0, enable_graph=False, sync_dist=False, sync_dist_op='mean', sync_dist_group=None)[source]

Log a key, value

Example:

result.log('val_loss', loss)

# defaults used
result.log(
    name,
    value,
    on_step=False,
    on_epoch=True,
    logger=True,
    prog_bar=False,
    reduce_fx=torch.mean
)
Parameters
  • name – key name

  • value – value name

  • prog_bar (bool) – if True logs to the progress base

  • logger (bool) – if True logs to the logger

  • on_step (bool) – if True logs the output of validation_step or test_step

  • on_epoch (bool) – if True, logs the output of the training loop aggregated

  • reduce_fx (Callable) – Torch.mean by default

  • tbptt_reduce_fx (Callable) – function to reduce on truncated back prop

  • tbptt_pad_token (int) – token to use for padding

  • enable_graph (bool) – if True, will not auto detach the graph

  • sync_dist (bool) – if True, reduces the metric across GPUs/TPUs

  • sync_dist_op (Union[Any, str]) – the op to sync across

  • sync_dist_group (Optional[Any]) – the ddp group

log_dict(dictionary, prog_bar=False, logger=True, on_step=False, on_epoch=True, reduce_fx=torch.mean, tbptt_reduce_fx=torch.mean, tbptt_pad_token=0, enable_graph=False, sync_dist=False, sync_dist_op='mean', sync_dist_group=None)[source]

Log a dictonary of values at once

Example:

values = {'loss': loss, 'acc': acc, ..., 'metric_n': metric_n}
result.log_dict(values)
Parameters
  • dictionary (dict) – key value pairs (str, tensors)

  • prog_bar (bool) – if True logs to the progress base

  • logger (bool) – if True logs to the logger

  • on_step (bool) – if True logs the output of validation_step or test_step

  • on_epoch (bool) – if True, logs the output of the training loop aggregated

  • reduce_fx (Callable) – Torch.mean by default

  • tbptt_reduce_fx (Callable) – function to reduce on truncated back prop

  • tbptt_pad_token (int) – token to use for padding

  • enable_graph (bool) – if True, will not auto detach the graph

  • sync_dist (bool) – if True, reduces the metric across GPUs/TPUs

  • sync_dist_op (Union[Any, str]) – the op to sync across

  • sync_dist_group (Optional[Any]) – the ddp group

write(name, values, filename='predictions.pt')[source]

Add feature name and value pair to collection of predictions that will be written to disk on validation_end or test_end. If running on multiple GPUs, you will get separate n_gpu prediction files with the rank prepended onto filename.

Example:

result = pl.EvalResult()
result.write('ids', [0, 1, 2])
result.write('preds', ['cat', 'dog', 'dog'])
Parameters
  • name (str) – Feature name that will turn into column header of predictions file

  • values (Union[Tensor, list]) – Flat tensor or list of row values for given feature column ‘name’.

  • filename (str) – Filepath where your predictions will be saved. Defaults to ‘predictions.pt’.

write_dict(predictions_dict, filename='predictions.pt')[source]

Calls EvalResult.write() for each key-value pair in predictions_dict.

It is recommended that you use this function call instead of .write if you need to store more than one column of predictions in your output file.

Example:

predictions_to_write = {'preds': ['cat', 'dog'], 'ids': tensor([0, 1])}
result.write_dict(predictions_to_write)
Parameters
  • predictions_dict ([type]) – Dict of predictions to store and then write to filename at eval end.

  • filename (str, optional) – File where your predictions will be stored. Defaults to ‘./predictions.pt’.

Read the Docs v: 0.9.0
Versions
latest
stable
1.0.1
1.0.0
0.10.0
0.9.0
0.8.5
0.8.4
0.8.3
0.8.2
0.8.1
0.8.0
0.7.6
0.7.5
0.7.4
0.7.3
0.7.2
0.7.1
0.7.0
0.6.0
0.5.3.2
0.5.3
0.4.9
Downloads
pdf
html
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.