Shortcuts

datamodule

Functions

track_data_hook_calls

A decorator that checks if prepare_data/setup have been called.

Classes

LightningDataModule

A DataModule standardizes the training, val, test splits, data preparation and transforms.

LightningDataModule for loading DataLoaders with ease.

class pytorch_lightning.core.datamodule.LightningDataModule(*args, **kwargs)[source]

Bases: pytorch_lightning.core.hooks.CheckpointHooks, pytorch_lightning.core.hooks.DataHooks

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.

Example:

class MyDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
    def prepare_data(self):
        # download, split, etc...
        # only called on 1 GPU/TPU in distributed
    def setup(self):
        # make assignments here (val/train/test split)
        # called on every process in DDP
    def train_dataloader(self):
        train_split = Dataset(...)
        return DataLoader(train_split)
    def val_dataloader(self):
        val_split = Dataset(...)
        return DataLoader(val_split)
    def test_dataloader(self):
        test_split = Dataset(...)
        return DataLoader(test_split)

A DataModule implements 5 key methods:

  • prepare_data (things to do on 1 GPU/TPU not on every GPU/TPU in distributed mode).

  • setup (things to do on every accelerator in distributed mode).

  • train_dataloader the training dataloader.

  • val_dataloader the val dataloader(s).

  • test_dataloader the test dataloader(s).

This allows you to share a full dataset without explaining how to download, split transform and process the data

classmethod add_argparse_args(parent_parser)[source]

Extends existing argparse by default LightningDataModule attributes.

Return type

ArgumentParser

classmethod from_argparse_args(args, **kwargs)[source]

Create an instance from CLI arguments.

Parameters
  • args (Union[Namespace, ArgumentParser]) – The parser or namespace to take arguments from. Only known arguments will be parsed and passed to the LightningDataModule.

  • **kwargs – Additional keyword arguments that may override ones in the parser or namespace. These must be valid DataModule arguments.

Example:

parser = ArgumentParser(add_help=False)
parser = LightningDataModule.add_argparse_args(parser)
module = LightningDataModule.from_argparse_args(args)
classmethod from_datasets(train_dataset=None, val_dataset=None, test_dataset=None, batch_size=1, num_workers=0)[source]

Create an instance from torch.utils.data.Dataset.

Parameters
  • train_dataset (Union[Dataset, Sequence[Dataset], Mapping[str, Dataset], None]) – (optional) Dataset to be used for train_dataloader()

  • val_dataset (Union[Dataset, Sequence[Dataset], None]) – (optional) Dataset or list of Dataset to be used for val_dataloader()

  • test_dataset (Union[Dataset, Sequence[Dataset], None]) – (optional) Dataset or list of Dataset to be used for test_dataloader()

  • batch_size (int) – Batch size to use for each dataloader. Default is 1.

  • num_workers (int) – Number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Number of CPUs available.

classmethod get_init_arguments_and_types()[source]

Scans the DataModule signature and returns argument names, types and default values.

Returns

(argument name, set with argument types, argument default value).

Return type

List with tuples of 3 values

abstract prepare_data(*args, **kwargs)[source]

Use this to download and prepare data.

Warning

DO NOT set state to the model (use setup instead) since this is NOT called on every GPU in DDP/TPU

Example:

def prepare_data(self):
    # good
    download_data()
    tokenize()
    etc()

    # bad
    self.split = data_split
    self.some_state = some_other_state()

In DDP prepare_data can be called in two ways (using Trainer(prepare_data_per_node)):

  1. Once per node. This is the default and is only called on LOCAL_RANK=0.

  2. Once in total. Only called on GLOBAL_RANK=0.

Example:

# DEFAULT
# called once per node on LOCAL_RANK=0 of that node
Trainer(prepare_data_per_node=True)

# call on GLOBAL_RANK=0 (great for shared file systems)
Trainer(prepare_data_per_node=False)

This is called before requesting the dataloaders:

model.prepare_data()
    if ddp/tpu: init()
model.setup(stage)
model.train_dataloader()
model.val_dataloader()
model.test_dataloader()
size(dim=None)[source]

Return the dimension of each input either as a tuple or list of tuples. You can index this just as you would with a torch tensor.

Return type

Union[Tuple, int]

property dims

A tuple describing the shape of your data. Extra functionality exposed in size.

property has_prepared_data

Return bool letting you know if datamodule.prepare_data() has been called or not.

Returns

True if datamodule.prepare_data() has been called. False by default.

Return type

bool

property has_setup_fit

Return bool letting you know if datamodule.setup(‘fit’) has been called or not.

Returns

True if datamodule.setup(‘fit’) has been called. False by default.

Return type

bool

property has_setup_test

Return bool letting you know if datamodule.setup(‘test’) has been called or not.

Returns

True if datamodule.setup(‘test’) has been called. False by default.

Return type

bool

property test_transforms

Optional transforms (or collection of transforms) you can apply to test dataset

property train_transforms

Optional transforms (or collection of transforms) you can apply to train dataset

property val_transforms

Optional transforms (or collection of transforms) you can apply to validation dataset

pytorch_lightning.core.datamodule.track_data_hook_calls(fn)[source]

A decorator that checks if prepare_data/setup have been called.

  • When dm.prepare_data() is called, dm.has_prepared_data gets set to True

  • When dm.setup(‘fit’) is called, dm.has_setup_fit gets set to True

  • When dm.setup(‘test’) is called, dm.has_setup_test gets set to True

  • When dm.setup() is called without stage arg, both dm.has_setup_fit and dm.has_setup_test get set to True

Parameters

fn (function) – Function that will be tracked to see if it has been called.

Returns

Decorated function that tracks its call status and saves it to private attrs in its obj instance.

Return type

function