- class pytorch_lightning.callbacks.GPUStatsMonitor(memory_utilization=True, gpu_utilization=True, intra_step_time=False, inter_step_time=False, fan_speed=False, temperature=False)¶
Deprecated since version v1.5: The GPUStatsMonitor callback was deprecated in v1.5 and will be removed in v1.7. Please use the DeviceStatsMonitor callback instead.
Automatically monitors and logs GPU stats during training stage.
GPUStatsMonitoris a callback and in order to use it you need to assign a logger in the
MisconfigurationException – If NVIDIA driver is not installed, not running on GPUs, or
Trainerhas no logger.
>>> from pytorch_lightning import Trainer >>> from pytorch_lightning.callbacks import GPUStatsMonitor >>> gpu_stats = GPUStatsMonitor() >>> trainer = Trainer(callbacks=[gpu_stats])
GPU stats are mainly based on nvidia-smi –query-gpu command. The description of the queries is as follows:
fan.speed – The fan speed value is the percent of maximum speed that the device’s fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.
memory.used – Total memory allocated by active contexts.
memory.free – Total free memory.
utilization.gpu – Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.
utilization.memory – Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.
temperature.gpu – Core GPU temperature, in degrees C.
temperature.memory – HBM memory temperature, in degrees C.
- on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)¶
Called when the train batch ends.
- Return type
- on_train_batch_start(trainer, pl_module, batch, batch_idx)¶
Called when the train batch begins.
- Return type