pytorch_lightning.callbacks.gpu_usage_logger module¶

GPU Usage Logger¶

Log GPU memory and GPU usage during training

class pytorch_lightning.callbacks.gpu_usage_logger.GpuUsageLogger(memory_utilisation=True, gpu_utilisation=True, intra_step_time=False, inter_step_time=False, fan_speed=False, temperature=False)[source]¶

Bases: pytorch_lightning.callbacks.base.Callback

Automatically logs GPU memory and GPU usage during training stage.

Parameters

memory_utilisation¶ (bool) – Set to True to log used, free and percentage of memory utilisation at starts and ends of each step. Default: True. From nvidia-smi –help-query-gpu memory.used = `Total memory allocated by active contexts.` memory.free = `Total free memory.`
gpu_utilisation¶ (bool) – Set to True to log percentage of GPU utilisation. at starts and ends of each step. Default: True.
intra_step_time¶ (bool) – Set to True to log the time of each step. Default: False
inter_step_time¶ (bool) – Set to True to log the time between the end of one step and the start of the next. Default: False
fan_speed¶ (bool) – Set to True to log percentage of fan speed. Default: False.
temperature¶ (bool) – Set to True to log the memory and gpu temperature in degrees C. Default: False

Example:

>>> from pytorch_lightning import Trainer
>>> from pytorch_lightning.callbacks import GpuUsageLogger
>>> gpu_usage = GpuUsageLogger()
>>> trainer = Trainer(callbacks=[gpu_usage])

Gpu usage is mainly based on nvidia-smi –query-gpu command. The description of the queries used here as appears in in nvidia-smi --help-query-gpu:

“fan.speed” `The fan speed value is the percent of maximum speed that the device's fan is currently intended to run at. It ranges from 0 to 100 %. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.` “memory.used” `Total memory allocated by active contexts.` “memory.free” `Total free memory.` “utilization.gpu” `Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.` “utilization.memory” `Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.` “temperature.gpu” `Core GPU temperature. in degrees C.` “temperature.memory” `HBM memory temperature. in degrees C.`

static _get_gpu_stat(pitem, unit)[source]¶

_log_gpu(trainer)[source]¶

_log_memory(trainer)[source]¶

on_batch_end(trainer, pl_module)[source]¶: Called when the training batch ends.

on_batch_start(trainer, pl_module)[source]¶: Called when the training batch begins.

on_train_epoch_start(trainer, pl_module)[source]¶: Called when the train epoch begins.