Shortcuts

Fault-tolerant Training (expert)

Audience: Experts looking to enable and handle their own fault-tolerance.

Pre-requisites: Users must have first read Fault-tolrance Training (basic)


Enable fault-tolerant behavior anywhere

To enable fault tolerance on your own cloud or cluster environment enable the PL_FAULT_TOLERANT_TRAINING environment variable:

PL_FAULT_TOLERANT_TRAINING=1 python script.py

Although Lighting will now be fault-tolerant, you’ll have to handle all the nuances of making sure the models are automatically restarted.

Note

This complexity is already handled for you if you use lightning-grid.


Enable fault-tolerant behavior on your own cluster

The simplest way to enable fault-tolerant behavior is to enable lightning-grid to work on your on-prem cluster or cloud environment which will handle all the nuances of fault-tolerant training at scale.

Email us to connect with your own cloud account:

mailto:onprem@pytorchlightning.ai


© Copyright Copyright (c) 2018-2022, Lightning AI et al... Revision f4fcad36.

Built with Sphinx using a theme provided by Read the Docs.
Read the Docs v: latest
Versions
latest
stable
1.8.3post1
1.8.3.post0
1.8.3
1.8.2
1.8.1
1.8.0.post1
1.8.0
1.7.7
1.7.6
1.7.5
1.7.4
1.7.3
1.7.2
1.7.1
1.7.0
1.6.5
1.6.4
1.6.3
1.6.2
1.6.1
1.6.0
1.5.10
1.5.9
1.5.8
1.5.7
1.5.6
1.5.5
1.5.4
1.5.3
1.5.2
1.5.1
1.5.0
1.4.9
1.4.8
1.4.7
1.4.6
1.4.5
1.4.4
1.4.3
1.4.2
1.4.1
1.4.0
1.3.8
1.3.7
1.3.6
1.3.5
1.3.4
1.3.3
1.3.2
1.3.1
1.3.0
1.2.10
1.2.8
1.2.7
1.2.6
1.2.5
1.2.4
1.2.3
1.2.2
1.2.1
1.2.0
1.1.8
1.1.7
1.1.6
1.1.5
1.1.4
1.1.3
1.1.2
1.1.1
1.1.0
1.0.8
1.0.7
1.0.6
1.0.5
1.0.4
1.0.3
1.0.2
1.0.1
1.0.0
0.10.0
0.9.0
0.8.5
0.8.4
0.8.3
0.8.2
0.8.1
0.8.0
0.7.6
0.7.5
0.7.4
0.7.3
0.7.2
0.7.1
0.7.0
0.6.0
0.5.3
0.4.9
Downloads
html
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.