GPUStatsMonitor(memory_utilization=True, gpu_utilization=True, intra_step_time=False, inter_step_time=False, fan_speed=False, temperature=False)¶
Automatically monitors and logs GPU stats during training stage.
GPUStatsMonitoris a callback and in order to use it you need to assign a logger in the
>>> from pytorch_lightning import Trainer >>> from pytorch_lightning.callbacks import GPUStatsMonitor >>> gpu_stats = GPUStatsMonitor() >>> trainer = Trainer(callbacks=[gpu_stats])
GPU stats are mainly based on nvidia-smi –query-gpu command. The description of the queries is as follows:
fan.speed – The fan speed value is the percent of maximum speed that the device’s fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.
memory.used – Total memory allocated by active contexts.
memory.free – Total free memory.
utilization.gpu – Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.
utilization.memory – Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.
temperature.gpu – Core GPU temperature, in degrees C.
temperature.memory – HBM memory temperature, in degrees C.
on_train_batch_end(trainer, *args, **kwargs)¶
Called when the train batch ends.
on_train_batch_start(trainer, *args, **kwargs)¶
Called when the train batch begins.
Called when the train epoch begins.
on_train_start(trainer, *args, **kwargs)¶
Called when the train begins.