Shortcuts

Debug your model (advanced)

Audience: Users who want to debug distributed models.


Debug distributed models

To debug a distributed model, we recommend you debug it locally by running the distributed version on CPUs:

trainer = Trainer(accelerator="cpu", strategy="ddp", devices=2)

On the CPU, you can use pdb or breakpoint() or use regular print statements.

class LitModel(LightningModule):
    def training_step(self, batch, batch_idx):

        debugging_message = ...
        print(f"RANK - {self.trainer.global_rank}: {debugging_message}")

        if self.trainer.global_rank == 0:
            import pdb

            pdb.set_trace()

        # to prevent other processes from moving forward until all processes are in sync
        self.trainer.strategy.barrier()

When everything works, switch back to GPU by changing only the accelerator.

trainer = Trainer(accelerator="gpu", strategy="ddp", devices=2)

© Copyright Copyright (c) 2018-2022, Lightning AI et al... Revision f4fcad36.

Built with Sphinx using a theme provided by Read the Docs.
Read the Docs v: latest
Versions
latest
stable
1.8.3post1
1.8.3.post0
1.8.3
1.8.2
1.8.1
1.8.0.post1
1.8.0
1.7.7
1.7.6
1.7.5
1.7.4
1.7.3
1.7.2
1.7.1
1.7.0
1.6.5
1.6.4
1.6.3
1.6.2
1.6.1
1.6.0
1.5.10
1.5.9
1.5.8
1.5.7
1.5.6
1.5.5
1.5.4
1.5.3
1.5.2
1.5.1
1.5.0
1.4.9
1.4.8
1.4.7
1.4.6
1.4.5
1.4.4
1.4.3
1.4.2
1.4.1
1.4.0
1.3.8
1.3.7
1.3.6
1.3.5
1.3.4
1.3.3
1.3.2
1.3.1
1.3.0
1.2.10
1.2.8
1.2.7
1.2.6
1.2.5
1.2.4
1.2.3
1.2.2
1.2.1
1.2.0
1.1.8
1.1.7
1.1.6
1.1.5
1.1.4
1.1.3
1.1.2
1.1.1
1.1.0
1.0.8
1.0.7
1.0.6
1.0.5
1.0.4
1.0.3
1.0.2
1.0.1
1.0.0
0.10.0
0.9.0
0.8.5
0.8.4
0.8.3
0.8.2
0.8.1
0.8.0
0.7.6
0.7.5
0.7.4
0.7.3
0.7.2
0.7.1
0.7.0
0.6.0
0.5.3
0.4.9
Downloads
html
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.