Accelerator: HPU training¶
Audience: Users looking to save money and run large models faster using single or multiple Gaudi devices.
What is an HPU?¶
Habana® Gaudi® AI Processor (HPU) training processors are built on a heterogeneous architecture with a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries, and a configurable Matrix Math engine.
The TPC core is a VLIW SIMD processor with an instruction set and hardware tailored to serve training workloads efficiently. The Gaudi memory architecture includes on-die SRAM and local memories in each TPC and, Gaudi is the first DL training processor that has integrated RDMA over Converged Ethernet (RoCE v2) engines on-chip.
On the software side, the PyTorch Habana bridge interfaces between the framework and SynapseAI software stack to enable the execution of deep learning models on the Habana Gaudi device.
Gaudi offers a substantial price/performance advantage – so you get to do more deep learning training while spending less.
Run on 1 Gaudi¶
To enable PyTorch Lightning to utilize the HPU accelerator, simply provide
accelerator="hpu" parameter to the Trainer class.
trainer = Trainer(accelerator="hpu", devices=1)
Run on multiple Gaudis¶
accelerator="hpu" parameters to the Trainer class enables the Habana accelerator for distributed training with 8 Gaudis.
HPUParallelStrategy internally which is based on DDP strategy with the addition of Habana’s collective communication library (HCCL) to support scale-up within a node and scale-out across multiple nodes.
trainer = Trainer(devices=8, accelerator="hpu")
Select Gaudis automatically¶
Lightning can automatically detect the number of Gaudi devices to run on. This setting is enabled by default if the devices argument is missing.
# equivalent trainer = Trainer(accelerator="hpu") trainer = Trainer(accelerator="hpu", devices="auto")
How to access HPUs¶
To use HPUs, you must have access to a system with HPU devices.
Check out the Get Started Guide with AWS and Habana.
Habana dataloader is not supported.
torch.inference_mode()is not supported