Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
[1.3.8] - 2021-06-30¶
Fixed a sync deadlock when checkpointing a
LightningModule
that uses a torchmetrics 0.4Metric
(#8218)Fixed compatibility TorchMetrics v0.4 (#8206)
Added torchelastic check when sanitizing GPUs (#8095)
Fixed a DDP info message that was never shown (#8111)
Fixed metrics deprecation message at module import level (#8163)
Fixed a bug where an infinite recursion would be triggered when using the
BaseFinetuning
callback on a model that contains aModuleDict
(#8170)Added a mechanism to detect
deadlock
forDDP
when only 1 process trigger anException
. The mechanism willkill the processes
when it happens (#8167)Fixed NCCL error when selecting non-consecutive device ids (#8165)
Fixed SWA to also work with
IterableDataset
(#8172)
[1.3.7] - 2021-06-22¶
Fixed a bug where skipping an optimizer while using amp causes amp to trigger an assertion error (#7975)
Fixed deprecation messages not showing due to incorrect stacklevel (#8002, #8005)
Fixed setting a
DistributedSampler
when using a distributed plugin in a custom accelerator (#7814)Improved
PyTorchProfiler
chrome traces names (#8009)Fixed moving the best score to device in
EarlyStopping
callback for TPU devices (#7959)Fixed backward compatibility of moved functions
rank_zero_warn
andrank_zero_deprecation
(#8085)
[1.3.6] - 2021-06-15¶
[1.3.6] - Fixed¶
Fixed logs overwriting issue for remote filesystems (#7889)
Fixed
DataModule.prepare_data
could only be called on the global rank 0 process (#7945)Fixed setting
worker_init_fn
to seed dataloaders correctly when using DDP (#7942)Fixed
BaseFinetuning
callback to properly handle parent modules w/ parameters (#7931)
[1.3.5] - 2021-06-08¶
[1.3.5] - Added¶
Added warning to Training Step output (#7779)
[1.3.5] - Fixed¶
[1.3.5] - Changed¶
Move
training_output
validation to aftertrain_step_end
(#7868)
[1.3.4] - 2021-06-01¶
[1.3.4] - Fixed¶
[1.3.3] - 2021-05-27¶
[1.3.3] - Changed¶
Changed calling of
untoggle_optimizer(opt_idx)
out of the closure function (#7563)
[1.3.3] - Fixed¶
Fixed
ProgressBar
pickling after callingtrainer.predict
(#7608)Fixed broadcasting in multi-node, multi-gpu DDP using torch 1.7 (#7592)
Fixed dataloaders are not reset when tuning the model (#7566)
Fixed print errors in
ProgressBar
whentrainer.fit
is not called (#7674)Fixed global step update when the epoch is skipped (#7677)
Fixed training loop total batch counter when accumulate grad batches was enabled (#7692)
[1.3.2] - 2021-05-18¶
[1.3.2] - Changed¶
DataModule
s now avoid duplicate{setup,teardown,prepare_data}
calls for the same stage (#7238)
[1.3.2] - Fixed¶
Fixed parsing of multiple training dataloaders (#7433)
Fixed recursive passing of
wrong_type
keyword argument inpytorch_lightning.utilities.apply_to_collection
(#7433)Fixed setting correct
DistribType
forddp_cpu
(spawn) backend (#7492)Fixed incorrect number of calls to LR scheduler when
check_val_every_n_epoch > 1
(#7032)
[1.3.1] - 2021-05-11¶
[1.3.1] - Fixed¶
[1.3.0] - 2021-05-06¶
[1.3.0] - Added¶
Added support for the
EarlyStopping
callback to run at the end of the training epoch (#6944)Added synchronization points before and after
setup
hooks are run (#7202)Added a
teardown
hook toClusterEnvironment
(#6942)Added utils for metrics to scalar conversions (#7180)
Added utils for NaN/Inf detection for gradients and parameters (#6834)
Added more explicit exception message when trying to execute
trainer.test()
ortrainer.validate()
withfast_dev_run=True
(#6667)Added
LightningCLI
class to provide simple reproducibility with minimum boilerplate training CLI ( #4492, #6862, #7156, #7299)Added
gradient_clip_algorithm
argument to Trainer for gradient clipping by value (#6123).Added a way to print to terminal without breaking up the progress bar (#5470)
Added support to checkpoint after training steps in
ModelCheckpoint
callback (#6146)Added
TrainerStatus.{INITIALIZING,RUNNING,FINISHED,INTERRUPTED}
(#7173)Added
Trainer.validate()
method to perform one evaluation epoch over the validation set (#4948)Added
LightningEnvironment
for Lightning-specific DDP (#5915)Added
teardown()
hook to LightningDataModule (#4673)Added
auto_insert_metric_name
parameter toModelCheckpoint
(#6277)Added arg to
self.log
that enables users to give custom names when dealing with multiple dataloaders (#6274)Added
teardown
method toBaseProfiler
to enable subclasses defining post-profiling steps outside of__del__
(#6370)Added
setup
method toBaseProfiler
to enable subclasses defining pre-profiling steps for every process (#6633)Added no return warning to predict (#6139)
Added
Trainer.predict
config validation (#6543)Added
AbstractProfiler
interface (#6621)Added support for including module names for forward in the autograd trace of
PyTorchProfiler
(#6349)Added support for the PyTorch 1.8.1 autograd profiler (#6618)
Added
outputs
parameter to callback’son_validation_epoch_end
&on_test_epoch_end
hooks (#6120)Added
configure_sharded_model
hook (#6679)Added support for
precision=64
, enabling training with double precision (#6595)Added support for DDP communication hooks (#6736)
Added
artifact_location
argument toMLFlowLogger
which will be passed to theMlflowClient.create_experiment
call (#6677)Added
model
parameter to precision plugins’clip_gradients
signature ( #6764, #7231)Added
is_last_batch
attribute toTrainer
(#6825)Added
LightningModule.lr_schedulers()
for manual optimization (#6567)Added
MpModelWrapper
in TPU Spawn (#7045)Added
max_time
Trainer argument to limit training time (#6823)Added
on_predict_{batch,epoch}_{start,end}
hooks (#7141)Added new
EarlyStopping
parametersstopping_threshold
anddivergence_threshold
(#6868)Added
debug
flag to TPU Training Plugins (PT_XLA_DEBUG) (#7219)Added new
UnrepeatedDistributedSampler
andIndexBatchSamplerWrapper
for tracking distributed predictions (#7215)Added
trainer.predict(return_predictions=None|False|True)
(#7215)Added
BasePredictionWriter
callback to implement prediction saving (#7127)Added
trainer.tune(scale_batch_size_kwargs, lr_find_kwargs)
arguments to configure the tuning algorithms (#7258)Added
tpu_distributed
check for TPU Spawn barrier (#7241)Added device updates to TPU Spawn for Pod training (#7243)
Added warning when missing
Callback
and usingresume_from_checkpoint
(#7254)DeepSpeed single file saving (#6900)
Added Training type Plugins Registry ( #6982, #7063, #7214, #7224 )
Add
ignore
param tosave_hyperparameters
(#6056)
[1.3.0] - Changed¶
Changed
LightningModule.truncated_bptt_steps
to be property (#7323)Changed
EarlyStopping
callback from by default runningEarlyStopping.on_validation_end
if only training is run. Setcheck_on_train_epoch_end
to run the callback at the end of the train epoch instead of at the end of the validation epoch (#7069)Renamed
pytorch_lightning.callbacks.swa
topytorch_lightning.callbacks.stochastic_weight_avg
(#6259)Refactor
RunningStage
andTrainerState
usage ( #4945, #7173)Added
RunningStage.SANITY_CHECKING
Added
TrainerFn.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}
Changed
trainer.evaluating
to returnTrue
if validating or testing
Changed
setup()
andteardown()
stage argument to take any of{fit,validate,test,predict}
(#6386)Changed profilers to save separate report files per state and rank (#6621)
The trainer no longer tries to save a checkpoint on exception or run callback’s
on_train_end
functions (#6864)Changed
PyTorchProfiler
to usetorch.autograd.profiler.record_function
to record functions (#6349)Disabled
lr_scheduler.step()
in manual optimization (#6825)Changed warnings and recommendations for dataloaders in
ddp_spawn
(#6762)pl.seed_everything
will now also set the seed on theDistributedSampler
(#7024)Changed default setting for communication of multi-node training using
DDPShardedPlugin
(#6937)trainer.tune()
now returns the tuning result (#7258)LightningModule.from_datasets()
now acceptsIterableDataset
instances as training datasets. (#7503)Changed
resume_from_checkpoint
warning to an error when the checkpoint file does not exist (#7075)Automatically set
sync_batchnorm
fortraining_type_plugin
(#6536)Allowed training type plugin to delay optimizer creation (#6331)
Removed ModelSummary validation from train loop on_trainer_init (#6610)
Moved
save_function
to accelerator (#6689)Improved verbose logging for
EarlyStopping
callback (#6811)Run ddp_spawn dataloader checks on Windows (#6930)
Updated mlflow with using
resolve_tags
(#6746)Moved
save_hyperparameters
to its own function (#7119)Replaced
_DataModuleWrapper
with__new__
(#7289)Reset
current_fx
properties on lightning module in teardown (#7247)Auto-set
DataLoader.worker_init_fn
withseed_everything
(#6960)Remove
model.trainer
call inside of dataloading mixin (#7317)Split profilers module (#6261)
Ensure accelerator is valid if running interactively (#5970)
Disabled batch transfer in DP mode (#6098)
[1.3.0] - Deprecated¶
Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339)Deprecated
Trainer.truncated_bptt_steps
in favor ofLightningModule.truncated_bptt_steps
(#7323)Deprecated
outputs
in bothLightningModule.on_train_epoch_end
andCallback.on_train_epoch_end
hooks (#7339)Deprecated
LightningModule.grad_norm
in favor ofpytorch_lightning.utilities.grads.grad_norm
(#7292)Deprecated the
save_function
property from theModelCheckpoint
callback (#7201)Deprecated
LightningModule.write_predictions
andLightningModule.write_predictions_dict
(#7066)Deprecated
TrainerLoggingMixin
in favor of a separate utilities module for metric handling (#7180)Deprecated
TrainerTrainingTricksMixin
in favor of a separate utilities module for NaN/Inf detection for gradients and parameters (#6834)period
has been deprecated in favor ofevery_n_val_epochs
in theModelCheckpoint
callback (#6146)Deprecated
trainer.running_sanity_check
in favor oftrainer.sanity_checking
(#4945)Deprecated
Profiler(output_filename)
in favor ofdirpath
andfilename
(#6621)Deprecated
PytorchProfiler(profiled_functions)
in favor ofrecord_functions
(#6349)Deprecated
@auto_move_data
in favor oftrainer.predict
(#6993)Deprecated
Callback.on_load_checkpoint(checkpoint)
in favor ofCallback.on_load_checkpoint(trainer, pl_module, checkpoint)
(#7253)Deprecated metrics in favor of
torchmetrics
( #6505, #6530, #6540, #6547, #6515, #6572, #6573, #6584, #6636, #6637, #6649, #6659, #7131, )Deprecated the
LightningModule.datamodule
getter and setter methods; access them throughTrainer.datamodule
instead (#7168)Deprecated the use of
Trainer(gpus="i")
(string) for selecting the i-th GPU; from v1.5 this will set the number of GPUs instead of the index (#6388)
[1.3.0] - Removed¶
Removed the
exp_save_path
property from theLightningModule
(#7266)Removed training loop explicitly calling
EarlyStopping.on_validation_end
if no validation is run (#7069)Removed
automatic_optimization
as a property from the training loop in favor ofLightningModule.automatic_optimization
(#7130)Removed evaluation loop legacy returns for
*_epoch_end
hooks (#6973)Removed support for passing a bool value to
profiler
argument of Trainer (#6164)Removed no return warning from val/test step (#6139)
Removed passing a
ModelCheckpoint
instance toTrainer(checkpoint_callback)
(#6166)Removed deprecated Trainer argument
enable_pl_optimizer
andautomatic_optimization
(#6163)Removed deprecated metrics (#6161)
from
pytorch_lightning.metrics.functional.classification
removedto_onehot
,to_categorical
,get_num_classes
,roc
,multiclass_roc
,average_precision
,precision_recall_curve
,multiclass_precision_recall_curve
from
pytorch_lightning.metrics.functional.reduction
removedreduce
,class_reduce
Removed deprecated
ModelCheckpoint
argumentsprefix
,mode="auto"
(#6162)Removed
mode='auto'
fromEarlyStopping
(#6167)Removed
epoch
andstep
arguments fromModelCheckpoint.format_checkpoint_name()
, these are now included in themetrics
argument (#7344)Removed legacy references for magic keys in the
Result
object (#6016)Removed deprecated
LightningModule
hparams
setter (#6207)Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the
"log"/"progress_bar"
magic keys. Useself.log
instead (#6734)Removed
trainer.fit()
return value of1
. It has no return now (#7237)Removed
logger_connector
legacy code (#6733)Removed unused mixin attributes (#6487)
[1.3.0] - Fixed¶
Fixed NaN errors in progress bars when training with iterable datasets with no length defined (#7306)
Fixed attaching train and validation dataloaders when
reload_dataloaders_every_epoch=True
andnum_sanity_val_steps=0
(#7207)Added a barrier in the accelerator
teardown
to synchronize processes before execution finishes (#6814)Fixed multi-node DDP sub-process launch by using
local_rank
instead ofglobal_rank
for main process assertion (#7061)Fixed incorrect removal of
WORLD_SIZE
environment variable in DDP training when launching with torch distributed/torchelastic (#6942)Made the
Plugin.reduce
method more consistent across all Plugins to reflect a mean-reduction by default (#6011)Move lightning module to correct device type when using LightningDistributedWrapper (#6070)
Do not print top-k verbose log with
ModelCheckpoint(monitor=None)
(#6109)Fixed
ModelCheckpoint(save_top_k=0, save_last=True)
not saving thelast
checkpoint (#6136)Fixed
.teardown(stage='fit')
and.on_fit_{start,end}()
getting called duringtrainer.test
(#6386)Fixed LightningModule
all_gather
on cpu tensors (#6416)Fixed torch distributed not available in setup hook for DDP (#6506)
Fixed
trainer.tuner.{lr_find,scale_batch_size}
not setting theTrainer
state properly (#7258)Fixed bug where the learning rate schedulers did not follow the optimizer frequencies (#4868)
Fixed pickle error checker to now check for
pickle.PickleError
to catch all pickle errors (#6917)Fixed a bug where the outputs object passed to
LightningModule.training_epoch_end
was different from the object passed to theon_train_end_epoch
hook (#6969)Fixed a bug where the outputs passed to
train_batch_end
would be lists even when using a single optimizer and no truncated backprop through time steps (#6969)Fixed bug for trainer error handling which would cause hang for distributed training (#6864)
Fixed
self.device
not returning the correct device in replicas of data-parallel (#6414)Fixed
lr_find
trying beyondnum_training
steps and suggesting a too high learning rate (#7076)Fixed logger creating incorrect version folder in DDP with repeated
Trainer.fit
calls (#7077)Fixed metric objects passed directly to
self.log
not being reset correctly (#7055)Fixed
CombinedLoader
in distributed settings for validation / testing (#7102)Fixed the save_dir in
WandbLogger
when the run was initiated externally (#7106)Fixed
num_sanity_val_steps
affecting reproducibility of training data shuffling (#7014)Fixed resetting device after
fitting/evaluating/predicting
(#7188)Fixed bug where
trainer.tuner.scale_batch_size(max_trials=0)
would not return the correct batch size result (#7262)Fixed metrics not being properly logged with
precision=16
andmanual_optimization
(#7228)Fixed
BaseFinetuning
properly reloadingoptimizer_states
when usingresume_from_checkpoint
(#6891)Fixed
parameters_to_ignore
not properly set to DDPWrapper (#7239)Fixed parsing of
fast_dev_run=True
with the built-inArgumentParser
(#7240)Fixed handling an
IterableDataset
that fails to produce a batch at the beginning of an epoch (#7294)Fixed
LightningModule.save_hyperparameters()
when attempting to save an empty container (#7268)Fixed
apex
not properly instantiated when running withddp
(#7274)Fixed optimizer
state
not moved toGPU
(#7277)Fixed custom init args for
WandbLogger
(#6989)Fixed a bug where an error would be raised if the train dataloader sometimes produced None for a batch (#7342)
Fixed examples ( #6600, #6638, #7096, #7246, #6357, #6476, #6294, #6373, #6088, #7398 )
Resolved schedule step bug for PyTorch Profiler (#6674, #6681)
Updated logic for checking TPUs availability (#6767)
Resolve TPU miss rendezvous (#6781)
Fixed auto-scaling mode when calling tune method on trainer (#7321)
Fixed finetuning complex models correctly unfreezes (#6880)
Ensure we set the eval/train flag correctly on accelerator model (#6877)
Set better defaults for
rank_zero_only.rank
when training is launched with SLURM and torchelastic (#6802)Fixed matching the number of outputs of backward with forward for AllGatherGrad (#6625)
Fixed the
gradient_clip_algorithm
has no effect (#6928)Fixed CUDA OOM detection and handling (#6934)
Fixed
unfreeze_and_add_param_group
expectsmodules
rather thanmodule
(#6822)Fixed DPP + SyncBN when move on device (#6838)
Fixed missing arguments in
lr_find
call (#6784)Fixed
set_default_tensor_type
totorch.DoubleTensor
with precision=64 (#7108)Fixed
NeptuneLogger.log_text(step=None)
(#7194)
[1.2.9] - 2021-04-20¶
[1.2.9] - Fixed¶
[1.2.8] - 2021-04-14¶
[1.2.8] - Added¶
Added TPUSpawn + IterableDataset error message (#6875)
[1.2.8] - Fixed¶
Fixed process rank not being available right away after
Trainer
instantiation (#6941)Fixed
sync_dist
for tpus (#6950)Fixed
AttributeError
forrequire_backward_grad_sync
when running manual optimization with sharded plugin (#6915)Fixed
--gpus
default for parser returned byTrainer.add_argparse_args
(#6898)Fixed TPU Spawn all gather (#6896)
Fixed
EarlyStopping
logic whenmin_epochs
ormin_steps
requirement is not met (#6705)Fixed csv extension check (#6436)
Fixed checkpoint issue when using Horovod distributed backend (#6958)
Fixed tensorboard exception raising (#6901)
Fixed setting the eval/train flag correctly on accelerator model (#6983)
Fixed DDP_SPAWN compatibility with bug_report_model.py (#6892)
Fixed bug where
BaseFinetuning.flatten_modules()
was duplicating leaf node parameters (#6879)Set better defaults for
rank_zero_only.rank
when training is launched with SLURM and torchelastic:
[1.2.7] - 2021-04-06¶
[1.2.7] - Fixed¶
Fixed resolve a bug with omegaconf and xm.save (#6741)
Fixed an issue with IterableDataset when len is not defined (#6828)
Sanitize None params during pruning (#6836)
Enforce an epoch scheduler interval when using SWA (#6588)
Fixed TPU Colab hang issue, post training (#6816)
Fixed a bug where
TensorBoardLogger
would give a warning and not log correctly to a symbolic linksave_dir
(#6730)Fixed bug where
predict
could not be used whenprogress_bar_refresh_rate=0
(#6884)
[1.2.6] - 2021-03-30¶
[1.2.6] - Changed¶
Changed the behavior of
on_epoch_start
to run at the beginning of validation & test epoch (#6498)
[1.2.6] - Removed¶
Removed legacy code to include
step
dictionary returns incallback_metrics
. Useself.log_dict
instead. (#6682)
[1.2.6] - Fixed¶
Fixed
DummyLogger.log_hyperparams
raising aTypeError
when running withfast_dev_run=True
(#6398)Fixed error on TPUs when there was no
ModelCheckpoint
(#6654)Fixed
trainer.test
freeze on TPUs (#6654)Fixed a bug where gradients were disabled after calling
Trainer.predict
(#6657)Fixed bug where no TPUs were detected in a TPU pod env (#6719)
[1.2.5] - 2021-03-23¶
[1.2.5] - Changed¶
[1.2.5] - Fixed¶
[1.2.4] - 2021-03-16¶
[1.2.4] - Changed¶
Changed the default of
find_unused_parameters
back toTrue
in DDP and DDP Spawn (#6438)
[1.2.4] - Fixed¶
Expose DeepSpeed loss parameters to allow users to fix loss instability (#6115)
Fixed DP reduction with collection (#6324)
Fixed an issue where the tuner would not tune the learning rate if also tuning the batch size (#4688)
Fixed broadcast to use PyTorch
broadcast_object_list
and addreduce_decision
(#6410)Fixed logger creating directory structure too early in DDP (#6380)
Fixed DeepSpeed additional memory use on rank 0 when default device not set early enough (#6460)
Fixed an issue with
Tuner.scale_batch_size
not finding the batch size attribute in the datamodule (#5968)Fixed an exception in the layer summary when the model contains torch.jit scripted submodules (#6511)
Fixed when Train loop config was run during
Trainer.predict
(#6541)
[1.2.3] - 2021-03-09¶
[1.2.3] - Fixed¶
Fixed
ModelPruning(make_pruning_permanent=True)
pruning buffers getting removed when saved during training (#6073)Fixed when
_stable_1d_sort
to work whenn >= N
(#6177)Fixed
AttributeError
whenlogger=None
on TPU (#6221)Fixed PyTorch Profiler with
emit_nvtx
(#6260)Fixed
trainer.test
frombest_path
hangs after callingtrainer.fit
(#6272)Fixed
SingleTPU
callingall_gather
(#6296)Ensure we check DeepSpeed/Sharded in multi-node DDP (#6297
Check
LightningOptimizer
doesn’t delete optimizer hooks (#6305Resolve memory leak for evaluation (#6326
Ensure that clip gradients is only called if the value is greater than 0 (#6330
Fixed
Trainer
not resettinglightning_optimizers
when callingTrainer.fit()
multiple times (#6372)
[1.2.2] - 2021-03-02¶
[1.2.2] - Added¶
Added
checkpoint
parameter to callback’son_save_checkpoint
hook (#6072)
[1.2.2] - Changed¶
[1.2.2] - Fixed¶
Fixed epoch level schedulers not being called when
val_check_interval < 1.0
(#6075)Fixed multiple early stopping callbacks (#6197)
Fixed incorrect usage of
detach()
,cpu()
,to()
(#6216)Fixed LBFGS optimizer support which didn’t converge in automatic optimization (#6147)
Prevent
WandbLogger
from dropping values (#5931)Fixed error thrown when using valid distributed mode in multi node (#6297
[1.2.1] - 2021-02-23¶
[1.2.1] - Fixed¶
[1.2.0] - 2021-02-18¶
[1.2.0] - Added¶
Added
DataType
,AverageMethod
andMDMCAverageMethod
enum in metrics (#5657)Added support for summarized model total params size in megabytes (#5590)
Added support for multiple train loaders (#1959)
Added
Accuracy
metric now generalizes to Top-k accuracy for (multi-dimensional) multi-class inputs using thetop_k
parameter (#4838)Added
Accuracy
metric now enables the computation of subset accuracy for multi-label or multi-dimensional multi-class inputs with thesubset_accuracy
parameter (#4838)Added
HammingDistance
metric to compute the hamming distance (loss) (#4838)Added
max_fpr
parameter toauroc
metric for computing partial auroc metric (#3790)Added
StatScores
metric to compute the number of true positives, false positives, true negatives and false negatives (#4839)Added
R2Score
metric (#5241)Added
LambdaCallback
(#5347)Added
BackboneLambdaFinetuningCallback
(#5377)Accelerator
all_gather
supports collection (#5221)Added
image_gradients
functional metric to compute the image gradients of a given input image. (#5056)Added
MetricCollection
(#4318)Added
.clone()
method to metrics (#4318)Added
IoU
class interface (#4704)Support to tie weights after moving model to TPU via
on_post_move_to_device
hookAdded missing val/test hooks in
LightningModule
(#5467)The
Recall
andPrecision
metrics (and their functional counterpartsrecall
andprecision
) can now be generalized to Recall@K and Precision@K with the use oftop_k
parameter (#4842)Added
PyTorchProfiler
(#5560)Added compositional metrics (#5464)
Added Trainer method
predict(...)
for high performence predictions (#5579)Added
on_before_batch_transfer
andon_after_batch_transfer
data hooks (#3671)Added AUC/AUROC class interface (#5479)
Added
PredictLoop
object (#5752)Added
LightningModule.configure_callbacks
to enable the definition of model-specific callbacks (#5621)Added
dim
toPSNR
metric for mean-squared-error reduction (#5957)Added promxial policy optimization template to pl_examples (#5394)
Added
log_graph
toCometLogger
(#5295)Added possibility for nested loaders (#5404)
Added
sync_step
to Wandb logger (#5351)Added
StochasticWeightAveraging
callback (#5640)Added
LightningDataModule.from_datasets(...)
(#5133)Added
PL_TORCH_DISTRIBUTED_BACKEND
env variable to select backend (#5981)Added
Trainer
flag to activate Stochastic Weight Averaging (SWA)Trainer(stochastic_weight_avg=True)
(#6038)
[1.2.0] - Changed¶
Changed
stat_scores
metric now calculates stat scores over all classes and gains new parameters, in line with the newStatScores
metric (#4839)Changed
computer_vision_fine_tunning
example to useBackboneLambdaFinetuningCallback
(#5377)Changed
automatic casting
for LoggerConnectormetrics
(#5218)Changed
iou
[func] to allow float input (#4704)Metric
compute()
method will no longer automatically callreset()
(#5409)Set PyTorch 1.4 as min requirements, also for testing and examples
torchvision>=0.5
andtorchtext>=0.5
(#5418)Changed
callbacks
argument inTrainer
to allowCallback
input (#5446)Changed the default of
find_unused_parameters
toFalse
in DDP (#5185)Changed
ModelCheckpoint
version suffixes to start at 1 (#5008)Progress bar metrics tensors are now converted to float (#5692)
Changed the default value for the
progress_bar_refresh_rate
Trainer argument in Google COLAB notebooks to 20 (#5516)Extended support for purely iteration-based training (#5726)
Made
LightningModule.global_rank
,LightningModule.local_rank
andLightningModule.logger
read-only properties (#5730)Forced
ModelCheckpoint
callbacks to run after all others to guarantee all states are saved to the checkpoint (#5731)Refactored Accelerators and Plugins:
Added base classes for plugins (#5715)
Added parallel plugins for DP, DDP, DDPSpawn, DDP2 and Horovod (#5714)
Precision Plugins (#5718)
Added new Accelerators for CPU, GPU and TPU (#5719)
Added RPC and Sharded plugins (#5732)
Added missing
LightningModule
-wrapper logic to new plugins and accelerator (#5734)Moved device-specific teardown logic from training loop to accelerator (#5973)
Moved accelerator_connector.py to the connectors subfolder (#6033)
Trainer only references accelerator (#6039)
Made parallel devices optional across all plugins (#6051)
Enabled
self.log
in callbacks (#5094)Renamed xxx_AVAILABLE as protected (#5082)
Unified module names in Utils (#5199)
Refactor: clean trainer device & distributed getters (#5300)
Simplified training phase as LightningEnum (#5419)
Updated metrics to use LightningEnum (#5689)
Changed the seq of
on_train_batch_end
,on_batch_end
&on_train_epoch_end
,on_epoch_end hooks
(#5688)Refactored
setup_training
and removetest_mode
(#5388)Disabled training with zero
num_training_batches
when insufficientlimit_train_batches
(#5703)Refactored
EpochResultStore
(#5522)Update
lr_finder
to check for attribute if not runningfast_dev_run
(#5990)LightningOptimizer manual optimizer is more flexible and expose
toggle_model
(#5771)MlflowLogger
limit parameter value length to 250 char (#5893)Re-introduced fix for Hydra directory sync with multiple process (#5993)
[1.2.0] - Deprecated¶
Function
stat_scores_multiple_classes
is deprecated in favor ofstat_scores
(#4839)Moved accelerators and plugins to its
legacy
pkg (#5645)Deprecated
LightningDistributedDataParallel
in favor of new wrapper moduleLightningDistributedModule
(#5185)Deprecated
LightningDataParallel
in favor of new wrapper moduleLightningParallelModule
(#5670)Renamed utils modules (#5199)
argparse_utils
>>argparse
model_utils
>>model_helpers
warning_utils
>>warnings
xla_device_utils
>>xla_device
Deprecated using
'val_loss'
to set theModelCheckpoint
monitor (#6012)Deprecated
.get_model()
with explicit.lightning_module
property (#6035)Deprecated Trainer attribute
accelerator_backend
in favor ofaccelerator
(#6034)
[1.2.0] - Removed¶
[1.2.0] - Fixed¶
Fixed distributed setting and
ddp_cpu
only withnum_processes>1
(#5297)Fixed
num_workers
for Windows example (#5375)Fixed loading yaml (#5619)
Fixed support custom DataLoader with DDP if they can be re-instantiated (#5745)
Fixed repeated
.fit()
calls ignore max_steps iteration bound (#5936)Fixed throwing
MisconfigurationError
on unknown mode (#5255)Resolve bug with Finetuning (#5744)
Fixed
ModelCheckpoint
race condition in file existence check (#5155)Fixed some compatibility with PyTorch 1.8 (#5864)
Fixed forward cache (#5895)
Fixed recursive detach of tensors to CPU (#6007)
Fixed passing wrong strings for scheduler interval doesn’t throw an error (#5923)
Fixed wrong
requires_grad
state afterreturn None
with multiple optimizers (#5738)Fixed add
on_epoch_end
hook at the end ofvalidation
,test
epoch (#5986)Fixed missing
process_dataloader
call forTPUSpawn
when in distributed mode (#6015)Fixed progress bar flickering by appending 0 to floats/strings (#6009)
Fixed synchronization issues with TPU training (#6027)
Fixed
hparams.yaml
saved twice when usingTensorBoardLogger
(#5953)Fixed
fairscale
compatible with PT 1.8 (#5996)Ensured
process_dataloader
is called whentpu_cores > 1
to use Parallel DataLoader (#6015)Attempted SLURM auto resume call when non-shell call fails (#6002)
Fixed wrapping optimizers upon assignment (#6006)
Fixed allowing hashing of metrics with lists in their state (#5939)
[1.1.8] - 2021-02-08¶
[1.1.8] - Fixed¶
[1.1.7] - 2021-02-03¶
[1.1.7] - Fixed¶
Fixed
TensorBoardLogger
not closingSummaryWriter
onfinalize
(#5696)Fixed filtering of pytorch “unsqueeze” warning when using DP (#5622)
Fixed
num_classes
argument in F1 metric (#5663)Fixed
log_dir
property (#5537)Fixed a race condition in
ModelCheckpoint
when checking if a checkpoint file exists (#5144)Remove unnecessary intermediate layers in Dockerfiles (#5697)
Fixed auto learning rate ordering (#5638)
[1.1.6] - 2021-01-26¶
[1.1.6] - Changed¶
[1.1.6] - Fixed¶
Fixed
toggle_optimizer
to resetrequires_grad
state (#5574)Fixed FileNotFoundError for best checkpoint when using DDP with Hydra (#5629)
Fixed an error when logging a progress bar metric with a reserved name (#5620)
Fixed
Metric
’sstate_dict
not included when child modules (#5614)Fixed Neptune logger creating multiple experiments when GPUs > 1 (#3256)
Fixed duplicate logs appearing in console when using the python logging module (#5509)
Fixed tensor printing in
trainer.test()
(#5138)Fixed not using dataloader when
hparams
present (#4559)
[1.1.5] - 2021-01-19¶
[1.1.5] - Fixed¶
[1.1.4] - 2021-01-12¶
[1.1.4] - Added¶
Add automatic optimization property setter to lightning module (#5169)
[1.1.4] - Changed¶
Changed deprecated
enable_pl_optimizer=True
(#5244)
[1.1.4] - Fixed¶
Fixed
transfer_batch_to_device
for DDP withlen(devices_ids) == 1
(#5195)Logging only on
not should_accumulate()
during training (#5417)Resolve interpolation bug with Hydra (#5406)
Check environ before selecting a seed to prevent warning message (#4743)
Fixed signature mismatch in
model_to_device
ofDDPCPUHPCAccelerator
(#5505)
[1.1.3] - 2021-01-05¶
[1.1.3] - Added¶
[1.1.3] - Changed¶
[1.1.3] - Fixed¶
Fixed
trainer.test
returning non-test metrics (#5214)Fixed metric state reset (#5273)
Fixed
--num-nodes
onDDPSequentialPlugin
(#5327)Fixed invalid value for
weights_summary
(#5296)Fixed
Trainer.test
not using the latestbest_model_path
(#5161)Fixed existence check for hparams not using underlying filesystem (#5250)
Fixed
LightningOptimizer
AMP bug (#5191)Fixed casted key to string in
_flatten_dict
(#5354)
[1.1.2] - 2020-12-23¶
[1.1.2] - Added¶
[1.1.2] - Removed¶
enable_pl_optimizer=False
by default to temporarily fix AMP issues (#5163)
[1.1.2] - Fixed¶
Metric reduction with Logging (#5150)
Remove nan loss in manual optimization (#5121)
Un-balanced logging properly supported (#5119)
Fix hanging in DDP HPC accelerators (#5157)
Fix reset
TensorRunningAccum
(#5106)Updated
DALIClassificationLoader
to not use deprecated arguments (#4925)Corrected call to
torch.no_grad
(#5124)
[1.1.1] - 2020-12-15¶
[1.1.1] - Added¶
Add a notebook example to reach a quick baseline of ~94% accuracy on CIFAR10 using Resnet in Lightning (#4818)
[1.1.1] - Changed¶
[1.1.1] - Removed¶
[1.1.1] - Fixed¶
Fixed trainer by default
None
inDDPAccelerator
(#4915)Fixed
LightningOptimizer
to expose optimizer attributes (#5095)Do not warn when the
name
key is used in thelr_scheduler
dict (#5057)Check if optimizer supports closure (#4981)
Add deprecated metric utility functions back to functional ( #5067, #5068)
Allow any input in
to_onnx
andto_torchscript
(#4378)Fixed
DDPHPCAccelerator
hangs in DDP construction by callinginit_device
(#5157)
[1.1.0] - 2020-12-09¶
[1.1.0] - Added¶
Added “monitor” key to saved
ModelCheckpoints
(#4383)Added
ConfusionMatrix
class interface (#4348)Added multiclass AUROC metric (#4236)
Added global step indexing to the checkpoint name for a better sub-epoch checkpointing experience (#3807)
Added optimizer hooks in callbacks (#4379)
Added option to log momentum (#4384)
Added
current_score
toModelCheckpoint.on_save_checkpoint
(#4721)Added logging using
self.log
in train and evaluation for epoch end hooks ( #4552, #4495, #4439, #4684, #4913)Added ability for DDP plugin to modify optimizer state saving (#4675)
Added
prefix
argument in loggers (#4557)Added printing of total num of params, trainable and non-trainable params in ModelSummary (#4521)
Added
PrecisionRecallCurve, ROC, AveragePrecision
class metric (#4549)Added custom
Apex
andNativeAMP
asPrecision plugins
(#4355)Added
DALI MNIST
example (#3721)Added
sharded plugin
for DDP for multi-gpu training memory optimizations ( #4639, #4686, #4737, #4773)Added
experiment_id
to the NeptuneLogger (#3462)Added
Pytorch Geometric
integration example with Lightning (#4568)Added
all_gather
method toLightningModule
which allows gradient based tensor synchronizations for use-cases such as negative sampling. (#5012)Enabled
self.log
in most functions (#4969)Added changeable extension variable for
ModelCheckpoint
(#4977)
[1.1.0] - Changed¶
Tuner algorithms will be skipped if
fast_dev_run=True
(#3903)WandbLogger
does not force wandbreinit
arg to True anymore and creates a run only when needed (#4648)Changed
automatic_optimization
to be a model attribute (#4602)Changed
Simple Profiler
report to order by percentage time spent + num calls (#4880)Simplify optimization Logic (#4984)
Classification metrics overhaul (#4837)
Updated
fast_dev_run
to accept integer representing num_batches (#4629)Refactored optimizer (#4658)
[1.1.0] - Deprecated¶
[1.1.0] - Removed¶
[1.1.0] - Fixed¶
Added feature to move tensors to CPU before saving (#4309)
Fixed
LoggerConnector
to have logged metrics on root device in DP (#4138)Auto convert tensors to contiguous format when
gather_all
(#4907)Fixed
PYTHONPATH
for ddp test model (#4528)Fixed allowing logger to support indexing (#4595)
Fixed DDP and manual_optimization (#4976)
[1.0.8] - 2020-11-24¶
[1.0.8] - Added¶
[1.0.8] - Changed¶
Consistently use
step=trainer.global_step
inLearningRateMonitor
independently oflogging_interval
(#4376)Metric states are no longer as default added to
state_dict
(#4685)Renamed class metric
Fbeta
>>FBeta
(#4656)Model summary: add 1 decimal place (#4745)
Do not override
PYTHONWARNINGS
(#4700)Changed
init_ddp_connection
moved fromDDP
toDDPPlugin
(#4407)
[1.0.8] - Fixed¶
Fixed checkpoint
hparams
dict casting whenomegaconf
is available (#4770)Fixed incomplete progress bars when total batches not divisible by refresh rate (#4577)
Updated SSIM metric (#4566)
Fixed batch_arg_name - add
batch_arg_name
to all calls to_adjust_batch_size
bug (#4812)Fixed
torchtext
data to GPU (#4785)Fixed a crash bug in MLFlow logger (#4716)
[1.0.7] - 2020-11-17¶
[1.0.7] - Added¶
Added lambda closure to
manual_optimizer_step
(#4618)
[1.0.7] - Changed¶
[1.0.7] - Fixed¶
Prevent crash if
sync_dist=True
on CPU (#4626)Fixed average pbar Metrics (#4534)
Fixed
setup
callback hook to correctly pass the LightningModule through (#4608)Allowing decorate model init with saving
hparams
inside (#4662)Fixed
split_idx
set byLoggerConnector
inon_trainer_init
toTrainer
(#4697)
[1.0.6] - 2020-11-11¶
[1.0.6] - Added¶
Added metrics aggregation in Horovod and fixed early stopping (#3775)
Added
manual_optimizer_step
which work withAMP Native
andaccumulated_grad_batches
(#4485)Added
persistent(mode)
method to metrics, to enable and disable metric states being added tostate_dict
(#4482)Added congratulations at the end of our notebooks (#4555)
Added parameters
move_metrics_to_cpu
in Trainer to disable gpu leak (#4592)
[1.0.6] - Changed¶
[1.0.6] - Fixed¶
Fixed feature-lack in
hpc_load
(#4526)Fixed metrics states being overridden in DDP mode (#4482)
Fixed
lightning_getattr
,lightning_hasattr
not finding the correct attributes in datamodule (#4347)Fixed automatic optimization AMP by
manual_optimization_step
(#4485)Replace
MisconfigurationException
with warning inModelCheckpoint
Callback (#4560)Fixed logged keys in mlflow logger (#4412)
Fixed
is_picklable
by catchingAttributeError
(#4508)Fixed multi test dataloaders dict
AttributeError
error (#4480)Fixed show progress bar only for
progress_rank 0
onDDP_SLURM
(#4437)
[1.0.5] - 2020-11-03¶
[1.0.5] - Added¶
[1.0.5] - Changed¶
W&B log in sync with
Trainer
step (#4405)Hook
on_after_backward
is called only whenoptimizer_step
is being called (#4439)Moved
track_and_norm_grad
intotraining loop
and called only whenoptimizer_step
is being called (#4439)Changed type checker with explicit cast of
ref_model
object (#4457)Changed
distributed_backend
->accelerator
(#4429)
[1.0.5] - Deprecated¶
Deprecated passing
ModelCheckpoint
instance tocheckpoint_callback
Trainer argument (#4336)
[1.0.5] - Fixed¶
Disable saving checkpoints if not trained (#4372)
Fixed error using
auto_select_gpus=True
withgpus=-1
(#4209)Disabled training when
limit_train_batches=0
(#4371)Fixed that metrics do not store computational graph for all seen data (#4313)
Fixed AMP unscale for
on_after_backward
(#4439)Fixed TorchScript export when module includes Metrics (#4428)
Fixed TorchScript trace method’s data to device and docstring (#4360)
Fixed CSV logger warning (#4419)
Fixed skip DDP parameter sync (#4301)
Fixed
WandbLogger
_sanitize_callable function (#4422)Fixed
AMP Native
_unscale
gradient (#4441)
[1.0.4] - 2020-10-27¶
[1.0.4] - Added¶
Added
dirpath
andfilename
parameter inModelCheckpoint
(#4213)Added plugins docs and DDPPlugin to customize ddp across all accelerators (#4258)
Added
strict
option to the scheduler dictionary (#3586)Added
fsspec
support for profilers (#4162)Added autogenerated helptext to
Trainer.add_argparse_args
(#4344)Added support for string values in
Trainer
’sprofiler
parameter (#3656)Added
optimizer_closure
tooptimizer.step
when supported (#4190)Added unification of regression metrics (#4166)
Added checkpoint load from Bytes (#4314)
[1.0.4] - Changed¶
[1.0.4] - Deprecated¶
[1.0.4] - Fixed¶
Fixed setting device ids in DDP (#4297)
Fixed synchronization of best model path in
ddp_accelerator
(#4323)Fixed
WandbLogger
not uploading checkpoint artifacts at the end of training (#4341)Fixed
FBeta
computation (#4183)Fixed
accumulation across batches
has completedbefore breaking training loop
(#4278)Fixed
ModelCheckpoint
don’t increase current_epoch and global_step when not training (#4291)Fixed
COMET_EXPERIMENT_KEY
environment variable usage in comet logger (#4230)
[1.0.3] - 2020-10-20¶
[1.0.3] - Added¶
Added persistent flag to
Metric.add_state
(#4195)
[1.0.3] - Changed¶
[1.0.3] - Fixed¶
[1.0.2] - 2020-10-15¶
[1.0.2] - Added¶
Added trace functionality to the function
to_torchscript
(#4142)
[1.0.2] - Changed¶
Called
on_load_checkpoint
before loadingstate_dict
(#4057)
[1.0.2] - Removed¶
Removed duplicate metric vs step log for train loop (#4173)
[1.0.2] - Fixed¶
[1.0.1] - 2020-10-14¶
[1.0.1] - Added¶
Added getstate/setstate method for torch.save serialization (#4127)
[1.0.0] - 2020-10-13¶
[1.0.0] - Added¶
Added Explained Variance Metric + metric fix (#4013)
Added Metric <-> Lightning Module integration tests (#4008)
Added parsing OS env vars in
Trainer
(#4022)Added classification metrics (#4043)
Updated explained variance metric (#4024)
Enabled plugins (#4041)
Enabled custom clusters (#4048)
Enabled passing in custom accelerators (#4050)
Added
LightningModule.toggle_optimizer
(#4058)Added
LightningModule.manual_backward
(#4063)Added
output
argument to*_epoch_end
hooks (#3967)
[1.0.0] - Changed¶
[1.0.0] - Removed¶
Removed support for EvalResult and TrainResult (#3968)
Removed deprecated trainer flags:
overfit_pct
,log_save_interval
,row_log_interval
(#3969)Removed deprecated early_stop_callback (#3982)
Removed deprecated model hooks (#3980)
Removed deprecated callbacks (#3979)
Removed
trainer
argument inLightningModule.backward
#4056)
[1.0.0] - Fixed¶
[0.10.0] - 2020-10-07¶
[0.10.0] - Added¶
Enable PyTorch 1.7 compatibility (#3541)
Added
LightningModule.to_torchscript
to support exporting asScriptModule
(#3258)Added warning when dropping unpicklable
hparams
(#2874)Added EMB similarity (#3349)
Added
ModelCheckpoint.to_yaml
method (#3048)Allow
ModelCheckpoint
monitor to beNone
, meaning it will always save (#3630)Disabled optimizers setup during testing (#3059)
Added support for datamodules to save and load checkpoints when training (#3563)
Added support for datamodule in learning rate finder (#3425)
Added gradient clip test for native AMP (#3754)
Added dist lib to enable syncing anything across devices (#3762)
Added
broadcast
toTPUBackend
(#3814)Added
XLADeviceUtils
class to check XLA device type (#3274)
[0.10.0] - Changed¶
Refactored accelerator backends:
moved TPU
xxx_step
to backend (#3118)refactored DDP backend
forward
(#3119)refactored GPU backend
__step
(#3120)remove obscure forward call in eval + CPU backend
___step
(#3123)reduced all simplified forward (#3126)
added hook base method (#3127)
refactor eval loop to use hooks - use
test_mode
for if so we can split later (#3129)moved
___step_end
hooks (#3130)training forward refactor (#3134)
training AMP scaling refactor (#3135)
eval step scaling factor (#3136)
add eval loop object to streamline eval loop (#3138)
refactored dataloader process hook (#3139)
refactored inner eval loop (#3141)
final inner eval loop hooks (#3154)
clean up hooks in
run_evaluation
(#3156)clean up data reset (#3161)
expand eval loop out (#3165)
moved hooks around in eval loop (#3195)
remove
_evaluate
fx (#3197)Trainer.fit
hook clean up (#3198)DDPs train hooks (#3203)
reduced accelerator selection (#3211)
group prepare data hook (#3212)
added data connector (#3285)
modular is_overridden (#3290)
adding
Trainer.tune()
(#3293)move
run_pretrain_routine
->setup_training
(#3294)move train outside of setup training (#3297)
move
prepare_data
to data connector (#3307)moved accelerator router (#3309)
train loop refactor - moving train loop to own object (#3310, #3312, #3313, #3314)
duplicate data interface definition up into DataHooks class (#3344)
inner train loop (#3359, #3361, #3362, #3363, #3365, #3366, #3367, #3368, #3369, #3370, #3371, #3372, #3373, #3374, #3375, #3376, #3385, #3388, #3397)
all logging related calls in a connector (#3395)
added model connector (#3407)
moved eval loop logging to loggers (#3408)
moved eval loop (#3412#3408)
move
lr_finder
(#3434)move specific accelerator code (#3457)
group connectors (#3472)
apex plugin (#3502)
precision plugins (#3504)
Result - make monitor default to
checkpoint_on
to simplify (#3571)reference to the Trainer on the
LightningDataModule
(#3684)add
.log
to lightning module (#3686, #3699, #3701, #3704, #3715)enable tracking original metric when step and epoch are both true (#3685)
deprecated results obj, added support for simpler comms (#3681)
move backends back to individual files (#3712)
fixes logging for eval steps (#3763)
decoupled DDP, DDP spawn (#3733, #3766, #3767, #3774, #3802, #3806, #3817, #3819, #3927)
remove weight loading hack for ddp_cpu (#3808)
separate
torchelastic
from DDP (#3810)separate SLURM from DDP (#3809)
decoupled DDP2 (#3816)
bug fix with logging val epoch end + monitor (#3812)
callback system and init DDP (#3836)
epoch can now log independently (#3843)
test selecting the correct backend. temp backends while slurm and TorchElastic are decoupled (#3848)
fixed
init_slurm_connection
causing hostname errors (#3856)moves init apex from LM to apex connector (#3923)
moves sync bn to each backend (#3925)
moves configure ddp to each backend (#3924)
Deprecation warning (#3844)
Changed
LearningRateLogger
toLearningRateMonitor
(#3251)Used
fsspec
instead ofgfile
for all IO (#3320)Swaped
torch.load
forfsspec
load in DDP spawn backend (#3787)Swaped
torch.load
forfsspec
load in cloud_io loading (#3692)Added support for
to_disk()
to use remote filepaths withfsspec
(#3930)Updated model_checkpoint’s to_yaml to use
fsspec
open (#3801)Fixed
fsspec
is inconsistent when doingfs.ls
(#3805)
Refactor
GPUStatsMonitor
to improve training speed (#3257)Changed IoU score behavior for classes absent in target and pred (#3098)
Changed IoU
remove_bg
bool toignore_index
optional int (#3098)Changed defaults of
save_top_k
andsave_last
toNone
in ModelCheckpoint (#3680)row_log_interval
andlog_save_interval
are now based on training loop’sglobal_step
instead of epoch-internal batch index (#3667)Silenced some warnings. verified ddp refactors (#3483)
Cleaning up stale logger tests (#3490)
Allow
ModelCheckpoint
monitor to beNone
(#3633)Enable
None
model checkpoint default (#3669)Skipped
best_model_path
ifcheckpoint_callback
isNone
(#2962)Used
raise .. from ..
to explicitly chain exceptions (#3750)Mocking loggers (#3596, #3617, #3851, #3859, #3884, #3853, #3910, #3889, #3926)
Write predictions in LightningModule instead of EvalResult #3882
[0.10.0] - Deprecated¶
Deprecated
TrainResult
andEvalResult
, useself.log
andself.write
from theLightningModule
to log metrics and write predictions.training_step
can now only return a scalar (for the loss) or a dictionary with anything you want. (#3681)Deprecate
early_stop_callback
Trainer argument (#3845)Rename Trainer arguments
row_log_interval
>>log_every_n_steps
andlog_save_interval
>>flush_logs_every_n_steps
(#3748)
[0.10.0] - Removed¶
Removed experimental Metric API (#3943, #3949, #3946), listed changes before final removal:
Added hooks to metric module interface (#2528)
Added error when AUROC metric is used for multiclass problems (#3350)
Fixed
ModelCheckpoint
withsave_top_k=-1
option not tracking the best models when a monitor metric is available (#3735)Fixed counter-intuitive error being thrown in
Accuracy
metric for zero target tensor (#3764)Fixed aggregation of metrics (#3517)
Fixed Metric aggregation (#3321)
Fixed RMSLE metric (#3188)
Renamed
reduction
toclass_reduction
in classification metrics (#3322)Changed
class_reduction
similar to sklearn for classification metrics (#3322)Renaming of precision recall metric (#3308)
[0.10.0] - Fixed¶
Fixed
on_train_batch_start
hook to end epoch early (#3700)Fixed
num_sanity_val_steps
is clipped tolimit_val_batches
(#2917)Fixed ONNX model save on GPU (#3145)
Fixed
GpuUsageLogger
to work on different platforms (#3008)Fixed auto-scale batch size not dumping
auto_lr_find
parameter (#3151)Fixed
batch_outputs
with optimizer frequencies (#3229)Fixed setting batch size in
LightningModule.datamodule
when usingauto_scale_batch_size
(#3266)Fixed Horovod distributed backend compatibility with native AMP (#3404)
Fixed batch size auto scaling exceeding the size of the dataset (#3271)
Fixed getting
experiment_id
from MLFlow only once instead of each training loop (#3394)Fixed
overfit_batches
which now correctly disables shuffling for the training loader. (#3501)Fixed gradient norm tracking for
row_log_interval > 1
(#3489)Fixed
ModelCheckpoint
name formatting (#3164)Fixed example implementation of AutoEncoder (#3190)
Fixed invalid paths when remote logging with TensorBoard (#3236)
Fixed change
t()
totranspose()
as XLA devices do not support.t()
on 1-dim tensor (#3252)Fixed (weights only) checkpoints loading without PL (#3287)
Fixed
gather_all_tensors
cross GPUs in DDP (#3319)Fixed CometML save dir (#3419)
Fixed forward key metrics (#3467)
Fixed normalize mode at confusion matrix (replace NaNs with zeros) (#3465)
Fixed global step increment in training loop when
training_epoch_end
hook is used (#3673)Fixed dataloader shuffling not getting turned off with
overfit_batches > 0
anddistributed_backend = "ddp"
(#3534)Fixed determinism in
DDPSpawnBackend
when usingseed_everything
in main process (#3335)Fixed
ModelCheckpoint
period
to actually save everyperiod
epochs (#3630)Fixed
val_progress_bar
total withnum_sanity_val_steps
(#3751)Fixed Tuner dump: add
current_epoch
to dumped_params (#3261)Fixed
current_epoch
andglobal_step
properties mismatch betweenTrainer
andLightningModule
(#3785)Fixed learning rate scheduler for optimizers with internal state (#3897)
Fixed
tbptt_reduce_fx
when non-floating tensors are logged (#3796)Fixed model checkpoint frequency (#3852)
Fixed logging non-tensor scalar with result breaks subsequent epoch aggregation (#3855)
Fixed
TrainerEvaluationLoopMixin
activatesmodel.train()
at the end (#3858)Fixed
overfit_batches
when using with multiple val/test_dataloaders (#3857)Fixed enables
training_step
to returnNone
(#3862)Fixed init nan for checkpointing (#3863)
Fixed for
load_from_checkpoint
(#2776)Fixes incorrect
batch_sizes
when Dataloader returns a dict with multiple tensors (#3668)Fixed unexpected signature for
validation_step
(#3947)
[0.9.0] - 2020-08-20¶
[0.9.0] - Added¶
Added basic
CSVLogger
(#2721)Added SSIM metrics (#2671)
Added BLEU metrics (#2535)
Added support to export a model to ONNX format (#2596)
Added support for
Trainer(num_sanity_val_steps=-1)
to check all validation data before training (#2246)Added struct. output:
Added class
LightningDataModule
(#2668)Added support for PyTorch 1.6 (#2745)
Added call DataModule hooks implicitly in trainer (#2755)
Added support for Mean in DDP Sync (#2568)
Added remaining
sklearn
metrics:AveragePrecision
,BalancedAccuracy
,CohenKappaScore
,DCG
,Hamming
,Hinge
,Jaccard
,MeanAbsoluteError
,MeanSquaredError
,MeanSquaredLogError
,MedianAbsoluteError
,R2Score
,MeanPoissonDeviance
,MeanGammaDeviance
,MeanTweedieDeviance
,ExplainedVariance
(#2562)Added support for
limit_{mode}_batches (int)
to work with infinite dataloader (IterableDataset) (#2840)Added support returning python scalars in DP (#1935)
Added support to Tensorboard logger for OmegaConf
hparams
(#2846)Added tracking of basic states in
Trainer
(#2541)Tracks all outputs including TBPTT and multiple optimizers (#2890)
Added GPU Usage Logger (#2932)
Added
strict=False
forload_from_checkpoint
(#2819)Added saving test predictions on multiple GPUs (#2926)
Auto log the computational graph for loggers that support this (#3003)
Added warning when changing monitor and using results obj (#3014)
Added a hook
transfer_batch_to_device
to theLightningDataModule
(#3038)
[0.9.0] - Changed¶
Truncated long version numbers in progress bar (#2594)
Enabling val/test loop disabling (#2692)
Refactored into
accelerator
module:Using
.comet.config
file forCometLogger
(#1913)Updated hooks arguments - breaking for
setup
andteardown
(#2850)Using
gfile
to support remote directories (#2164)Moved optimizer creation after device placement for DDP backends (#2904)
Support
**DictConfig
forhparam
serialization (#2519)Removed callback metrics from test results obj (#2994)
Re-enabled naming metrics in ckpt name (#3060)
Changed progress bar epoch counting to start from 0 (#3061)
[0.9.0] - Deprecated¶
Deprecated Trainer attribute
ckpt_path
, which will now be set byweights_save_path
(#2681)
[0.9.0] - Removed¶
Removed deprecated: (#2760)
core decorator
data_loader
Module hook
on_sanity_check_start
and loadingload_from_metrics
package
pytorch_lightning.logging
Trainer arguments:
show_progress_bar
,num_tpu_cores
,use_amp
,print_nan_grads
LR Finder argument
num_accumulation_steps
[0.9.0] - Fixed¶
Fixed
accumulate_grad_batches
for last batch (#2853)Fixed setup call while testing (#2624)
Fixed local rank zero casting (#2640)
Fixed single scalar return from training (#2587)
Fixed Horovod backend to scale LR schedlers with the optimizer (#2626)
Fixed
dtype
anddevice
properties not getting updated in submodules (#2657)Fixed
fast_dev_run
to run for all dataloaders (#2581)Fixed
save_dir
in loggers getting ignored by default value ofweights_save_path
when user did not specifyweights_save_path
(#2681)Fixed
weights_save_path
getting ignored whenlogger=False
is passed to Trainer (#2681)Fixed TPU multi-core and Float16 (#2632)
Fixed test metrics not being logged with
LoggerCollection
(#2723)Fixed data transfer to device when using
torchtext.data.Field
andinclude_lengths is True
(#2689)Fixed shuffle argument for distributed sampler (#2789)
Fixed logging interval (#2694)
Fixed loss value in the progress bar is wrong when
accumulate_grad_batches > 1
(#2738)Fixed correct CWD for ddp sub-processes when using Hydra (#2719)
Fixed selecting GPUs using
CUDA_VISIBLE_DEVICES
(#2739)Fixed false
num_classes
warning in metrics (#2781)Fixed shell injection vulnerability in subprocess call (#2786)
Fixed LR finder and
hparams
compatibility (#2821)Fixed
ModelCheckpoint
not saving the latest information whensave_last=True
(#2881)Fixed ImageNet example: learning rate scheduler, number of workers and batch size when using DDP (#2889)
Fixed apex gradient clipping (#2829)
Fixed save apex scaler states (#2828)
Fixed a model loading issue with inheritance and variable positional arguments (#2911)
Fixed passing
non_blocking=True
when transferring a batch object that does not support it (#2910)Fixed checkpointing to remote file paths (#2925)
Fixed adding val step argument to metrics (#2986)
Fixed an issue that caused
Trainer.test()
to stall in ddp mode (#2997)Fixed gathering of results with tensors of varying shape (#3020)
Fixed batch size auto-scaling feature to set the new value on the correct model attribute (#3043)
Fixed automatic batch scaling not working with half precision (#3045)
Fixed setting device to root gpu (#3042)
[0.8.5] - 2020-07-09¶
[0.8.5] - Added¶
[0.8.5] - Removed¶
Removed auto val reduce (#2462)
[0.8.5] - Fixed¶
Flattening Wandb Hyperparameters (#2459)
Fixed using the same DDP python interpreter and actually running (#2482)
Fixed model summary input type conversion for models that have input dtype different from model parameters (#2510)
Made
TensorBoardLogger
andCometLogger
pickleable (#2518)Fixed a problem with
MLflowLogger
creating multiple run folders (#2502)Fixed global_step increment (#2455)
Fixed TPU hanging example (#2488)
Fixed
argparse
default value bug (#2526)Fixed Dice and IoU to avoid NaN by adding small eps (#2545)
Fixed accumulate gradients schedule at epoch 0 (continued) (#2513)
Fixed Trainer
.fit()
returning last not best weights in “ddp_spawn” (#2565)Fixed passing (do not pass) TPU weights back on test (#2566)
[0.8.4] - 2020-07-01¶
[0.8.4] - Added¶
[0.8.4] - Changed¶
Enabled no returns from eval (#2446)
[0.8.4] - Fixed¶
[0.8.3] - 2020-06-29¶
[0.8.3] - Fixed¶
[0.8.2] - 2020-06-28¶
[0.8.2] - Added¶
Added TorchText support for moving data to GPU (#2379)
[0.8.2] - Changed¶
[0.8.2] - Removed¶
Moved
TrainsLogger
to Bolts (#2384)
[0.8.2] - Fixed¶
Fixed parsing TPU arguments and TPU tests (#2094)
Fixed number batches in case of multiple dataloaders and
limit_{*}_batches
(#1920, #2226)Fixed an issue with forward hooks not being removed after model summary (#2298)
Fix for
load_from_checkpoint()
not working with absolute path on Windows (#2294)Fixed an issue how _has_len handles
NotImplementedError
e.g. raised bytorchtext.data.Iterator
(#2293), (#2307)Fixed
average_precision
metric (#2319)Fixed ROC metric for CUDA tensors (#2304)
Fixed lost compatibility with custom datatypes implementing
.to
(#2335)Fixed loading model with kwargs (#2387)
Fixed sum(0) for
trainer.num_val_batches
(#2268)Fixed checking if the parameters are a
DictConfig
Object (#2216)Fixed SLURM weights saving (#2341)
Fixed swaps LR scheduler order (#2356)
Fixed adding tensorboard
hparams
logging test (#2342)Fixed use model ref for tear down (#2360)
Fixed logger crash on DDP (#2388)
Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
Fixed loading past checkpoints from v0.7.x (#2405)
Fixed loading model without arguments (#2403)
Fixed Windows compatibility issue (#2358)
[0.8.1] - 2020-06-19¶
[0.8.1] - Fixed¶
[0.8.0] - 2020-06-18¶
[0.8.0] - Added¶
Added
overfit_batches
,limit_{val|test}_batches
flags (overfit now uses training set for all three) (#2213)Added metrics
Allow dataloaders without sampler field present (#1907)
Added option
save_last
to save the model at the end of every epoch inModelCheckpoint
(#1908)Early stopping checks
on_validation_end
(#1458)Speed up single-core TPU training by loading data using
ParallelLoader
(#2033)Added a model hook
transfer_batch_to_device
that enables moving custom data structures to the target device (#1756)Added black formatter for the code with code-checker on pull (#1610)
Added back the slow spawn ddp implementation as
ddp_spawn
(#2115)Added loading checkpoints from URLs (#1667)
Added a callback method
on_keyboard_interrupt
for handling KeyboardInterrupt events during training (#2134)Added a decorator
auto_move_data
that moves data to the correct device when using the LightningModule for inference (#1905)Added
ckpt_path
option toLightningModule.test(...)
to load particular checkpoint (#2190)Added
setup
andteardown
hooks for model (#2229)
[0.8.0] - Changed¶
Allow user to select individual TPU core to train on (#1729)
Removed non-finite values from loss in
LRFinder
(#1862)Allow passing model hyperparameters as complete kwarg list (#1896)
Renamed
ModelCheckpoint
’s attributesbest
tobest_model_score
andkth_best_model
tokth_best_model_path
(#1799)Re-Enable Logger’s
ImportError
s (#1938)Changed the default value of the Trainer argument
weights_summary
fromfull
totop
(#2029)Raise an error when lightning replaces an existing sampler (#2020)
Enabled
prepare_data
from correct processes - clarify local vs global rank (#2166)Remove explicit flush from tensorboard logger (#2126)
Changed epoch indexing from 1 instead of 0 (#2206)
[0.8.0] - Deprecated¶
Deprecated flags: (#2213)
overfit_pct
in favour ofoverfit_batches
val_percent_check
in favour oflimit_val_batches
test_percent_check
in favour oflimit_test_batches
Deprecated
ModelCheckpoint
’s attributesbest
andkth_best_model
(#1799)Dropped official support/testing for older PyTorch versions <1.3 (#1917)
Deprecated Trainer
proc_rank
in favour ofglobal_rank
(#2166, #2269)
[0.8.0] - Removed¶
Removed unintended Trainer argument
progress_bar_callback
, the callback should be passed in byTrainer(callbacks=[...])
instead (#1855)Removed obsolete
self._device
in Trainer (#1849)Removed deprecated API (#2073)
Packages:
pytorch_lightning.pt_overrides
,pytorch_lightning.root_module
Modules:
pytorch_lightning.logging.comet_logger
,pytorch_lightning.logging.mlflow_logger
,pytorch_lightning.logging.test_tube_logger
,pytorch_lightning.overrides.override_data_parallel
,pytorch_lightning.core.model_saving
,pytorch_lightning.core.root_module
Trainer arguments:
add_row_log_interval
,default_save_path
,gradient_clip
,nb_gpu_nodes
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
Trainer attributes:
nb_gpu_nodes
,num_gpu_nodes
,gradient_clip
,max_nb_epochs
,min_nb_epochs
,nb_sanity_val_steps
,default_save_path
,tng_tqdm_dic
[0.8.0] - Fixed¶
Run graceful training teardown on interpreter exit (#1631)
Fixed user warning when apex was used together with learning rate schedulers (#1873)
Fixed multiple calls of
EarlyStopping
callback (#1863)Fixed an issue with
Trainer.from_argparse_args
when passing in unknown Trainer args (#1932)Fixed bug related to logger not being reset correctly for model after tuner algorithms (#1933)
Fixed root node resolution for SLURM cluster with dash in host name (#1954)
Fixed
LearningRateLogger
in multi-scheduler setting (#1944)Fixed test configuration check and testing (#1804)
Fixed an issue with Trainer constructor silently ignoring unknown/misspelled arguments (#1820)
Fixed
save_weights_only
in ModelCheckpoint (#1780)Allow use of same
WandbLogger
instance for multiple training loops (#2055)Fixed an issue with
_auto_collect_arguments
collecting local variables that are not constructor arguments and not working for signatures that have the instance not namedself
(#2048)Fixed mistake in parameters’ grad norm tracking (#2012)
Fixed CPU and hanging GPU crash (#2118)
Fixed an issue with the model summary and
example_input_array
depending on a specific ordering of the submodules in a LightningModule (#1773)Fixed Tpu logging (#2230)
[0.7.6] - 2020-05-16¶
[0.7.6] - Added¶
Added callback for logging learning rates (#1498)
Added transfer learning example (for a binary classification task in computer vision) (#1564)
Added type hints in
Trainer.fit()
andTrainer.test()
to reflect that also a list of dataloaders can be passed in (#1723).Added auto scaling of batch size (#1638)
The progress bar metrics now also get updated in
training_epoch_end
(#1724)Enable
NeptuneLogger
to work withdistributed_backend=ddp
(#1753)Added option to provide seed to random generators to ensure reproducibility (#1572)
Added override for hparams in
load_from_ckpt
(#1797)Added support multi-node distributed execution under
torchelastic
(#1811, #1818)Added dummy logger for internally disabling logging for some features (#1836)
[0.7.6] - Changed¶
Enable
non-blocking
for device transfers to GPU (#1843)Replace mata_tags.csv with hparams.yaml (#1271)
Reduction when
batch_size < num_gpus
(#1609)Updated LightningTemplateModel to look more like Colab example (#1577)
Don’t convert
namedtuple
totuple
when transferring the batch to target device (#1589)Allow passing hparams as keyword argument to LightningModule when loading from checkpoint (#1639)
Args should come after the last positional argument (#1807)
Made ddp the default if no backend specified with multiple GPUs (#1789)
[0.7.6] - Deprecated¶
Deprecated
tags_csv
in favor ofhparams_file
(#1271)
[0.7.6] - Fixed¶
Fixed broken link in PR template (#1675)
Fixed ModelCheckpoint not None checking filepath (#1654)
Trainer now calls
on_load_checkpoint()
when resuming from a checkpoint (#1666)Fixed sampler logic for ddp with iterable dataset (#1734)
Fixed
_reset_eval_dataloader()
for IterableDataset (#1560)Fixed Horovod distributed backend to set the
root_gpu
property (#1669)Fixed wandb logger
global_step
affects other loggers (#1492)Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
Fixed bugs that prevent lr finder to be used together with early stopping and validation dataloaders (#1676)
Fixed a bug in Trainer that prepended the checkpoint path with
version_
when it shouldn’t (#1748)Fixed lr key name in case of param groups in LearningRateLogger (#1719)
Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
Fixed num processes wasn’t being set properly and auto sampler was ddp failing (#1819)
Fixed bugs in semantic segmentation example (#1824)
Fixed saving native AMP scaler state (#1777)
Fixed native amp + ddp (#1788)
Fixed
hparam
logging with metrics (#1647)
[0.7.5] - 2020-04-27¶
[0.7.5] - Changed¶
Allow logging of metrics together with
hparams
(#1630)
[0.7.5] - Removed¶
Removed Warning from trainer loop (#1634)
[0.7.5] - Fixed¶
[0.7.4] - 2020-04-26¶
[0.7.4] - Added¶
Added flag
replace_sampler_ddp
to manually disable sampler replacement in DDP (#1513)Added
auto_select_gpus
flag to trainer that enables automatic selection of available GPUs on exclusive mode systems.Added learning rate finder (#1347)
Added support for DDP mode in clusters without SLURM (#1387)
Added
test_dataloaders
parameter toTrainer.test()
(#1434)Added
terminate_on_nan
flag to trainer that performs a NaN check with each training iteration when set toTrue
(#1475)Added speed parity tests (max 1 sec difference per epoch)(#1482)
Added
ddp_cpu
backend for testing ddp without GPUs (#1158)Added Horovod support as a distributed backend
Trainer(distributed_backend='horovod')
(#1529)Added support for 8 core distributed training on Kaggle TPU’s (#1568)
[0.7.4] - Changed¶
Changed the default behaviour to no longer include a NaN check with each training iteration (#1475)
Decoupled the progress bar from trainer` it is a callback now and can be customized or even be replaced entirely (#1450).
Changed lr schedule step interval behavior to update every backwards pass instead of every forwards pass (#1477)
Defines shared proc. rank, remove rank from instances (e.g. loggers) (#1408)
Updated semantic segmentation example with custom U-Net and logging (#1371)
Disabled val and test shuffling (#1600)
[0.7.4] - Deprecated¶
Deprecated
training_tqdm_dict
in favor ofprogress_bar_dict
(#1450).
[0.7.4] - Removed¶
Removed
test_dataloaders
parameter fromTrainer.fit()
(#1434)
[0.7.4] - Fixed¶
Added the possibility to pass nested metrics dictionaries to loggers (#1582)
Fixed memory leak from opt return (#1528)
Fixed saving checkpoint before deleting old ones (#1453)
Fixed loggers - flushing last logged metrics even before continue, e.g.
trainer.test()
results (#1459)Fixed optimizer configuration when
configure_optimizers
returns dict withoutlr_scheduler
(#1443)Fixed
LightningModule
- mixing hparams and arguments inLightningModule.__init__()
crashes load_from_checkpoint() (#1505)Added a missing call to the
on_before_zero_grad
model hook (#1493).Allow use of sweeps with
WandbLogger
(#1512)Fixed a bug that caused the
callbacks
Trainer argument to reference a global variable (#1534).Fixed a bug that set all boolean CLI arguments from
Trainer.add_argparse_args
always to True (#1571)Fixed do not copy the batch when training on a single GPU (#1576, #1579)
Fixed soft checkpoint removing on DDP (#1408)
Fixed automatic parser bug (#1585)
Fixed bool conversion from string (#1606)
[0.7.3] - 2020-04-09¶
[0.7.3] - Added¶
Added
rank_zero_warn
for warning only in rank 0 (#1428)
[0.7.3] - Fixed¶
[0.7.2] - 2020-04-07¶
[0.7.2] - Added¶
Added same step loggers’ metrics aggregation (#1278)
Added parity test between a vanilla MNIST model and lightning model (#1284)
Added parity test between a vanilla RNN model and lightning model (#1351)
Added Reinforcement Learning - Deep Q-network (DQN) lightning example (#1232)
Added support for hierarchical
dict
(#1152)Added
TrainsLogger
class (#1122)Added type hints to
pytorch_lightning.core
(#946)Added support for
IterableDataset
in validation and testing (#1104)Added support for non-primitive types in
hparams
forTensorboardLogger
(#1130)Added a check that stops the training when loss or weights contain
NaN
orinf
values. (#1097)Added support for
IterableDataset
whenval_check_interval=1.0
(default), this will trigger validation at the end of each epoch. (#1283)Added
summary
method to Profilers. (#1259)Added informative errors if user defined dataloader has zero length (#1280)
Added testing for python 3.8 (#915)
Added model configuration checking (#1199)
Added support for optimizer frequencies through
LightningModule.configure_optimizers()
(#1269)Added option to run without an optimizer by returning
None
fromconfigure_optimizers
. (#1279)Added a warning when the number of data loader workers is small. (#1378)
[0.7.2] - Changed¶
Changed (renamed and refatored)
TensorRunningMean
->TensorRunningAccum
: running accumulations were generalized. (#1278)Changed
progress_bar_refresh_rate
trainer flag to disable progress bar when set to 0. (#1108)Enhanced
load_from_checkpoint
to also forward params to the model (#1307)Updated references to
self.forward()
to instead use the__call__
interface. (#1211)Changed default behaviour of
configure_optimizers
to use no optimizer rather than Adam. (#1279)Allow to upload models on W&B (#1339)
On DP and DDP2 unsqueeze is automated now (#1319)
Did not always create a DataLoader during reinstantiation, but the same type as before (if subclass of DataLoader) (#1346)
Did not interfere with a default sampler (#1318)
Remove default Adam optimizer (#1317)
Give warnings for unimplemented required lightning methods (#1317)
Made
evaluate
method private >>Trainer._evaluate(...)
. (#1260)Simplify the PL examples structure (shallower and more readable) (#1247)
Changed min max gpu memory to be on their own plots (#1358)
Remove
.item
which causes sync issues (#1254)Changed smoothing in TQDM to decrease variability of time remaining between training / eval (#1194)
Change default logger to dedicated one (#1064)
[0.7.2] - Deprecated¶
[0.7.2] - Removed¶
[0.7.2] - Fixed¶
Fixed
model_checkpoint
when saving all models (#1359)Trainer.add_argparse_args
classmethod fixed. Now it adds a type for the arguments (#1147)Fixed bug related to type checking of
ReduceLROnPlateau
lr schedulers(#1126)Fixed a bug to ensure lightning checkpoints to be backward compatible (#1132)
Fixed a bug that created an extra dataloader with active
reload_dataloaders_every_epoch
(#1196)Fixed all warnings and errors in the docs build process (#1191)
Fixed an issue where
val_percent_check=0
would not disable validation (#1251)Fixed average of incomplete
TensorRunningMean
(#1309)Fixed
WandbLogger.watch
withwandb.init()
(#1311)Fixed an issue with early stopping that would prevent it from monitoring training metrics when validation is disabled / not implemented (#1235).
Fixed a bug that would cause
trainer.test()
to run on the validation set when overloadingvalidation_epoch_end
andtest_end
(#1353)Fixed
WandbLogger.watch
- use of the watch method without importingwandb
(#1311)Fixed
WandbLogger
to be used with ‘ddp’ - allow reinits in sub-processes (#1149, #1360)Made
training_epoch_end
behave likevalidation_epoch_end
(#1357)Fixed
fast_dev_run
running validation twice (#1365)Fixed pickle error from quick patch
__code__
(#1352)Fixed checkpointing interval (#1272)
Fixed validation and training loops run the partial dataset (#1192)
Fixed running
on_validation_end
only on main process in DDP (#1125)Fixed
load_spawn_weights
only in proc rank 0 (#1385)Fixes using deprecated
use_amp
attribute (#1145)Fixed Tensorboard logger error: lightning_logs directory not exists in multi-node DDP on nodes with rank != 0 (#1377)
Fixed
Unimplemented backend XLA
error on TPU (#1387)
[0.7.1] - 2020-03-07¶
[0.7.1] - Fixed¶
Fixes
print
issues anddata_loader
(#1080)
[0.7.0] - 2020-03-06¶
[0.7.0] - Added¶
Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
Added
reload_dataloaders_every_epoch=False
flag for trainer. Some users require reloading data every epoch (#926)Added
progress_bar_refresh_rate=50
flag for trainer. Throttle refresh rate on notebooks (#926)Updated governance docs
Added a check to ensure that the metric used for early stopping exists before training commences (#542)
Added
optimizer_idx
argument tobackward
hook (#733)Added
entity
argument toWandbLogger
to be passed towandb.init
(#783)Added a tool for profiling training runs (#782)
Improved flexibility for naming of TensorBoard logs, can now set
version
to astr
to just save to that directory, and usename=''
to prevent experiment-name directory (#804)Added option to specify
step
key when logging metrics (#808)Added
train_dataloader
,val_dataloader
andtest_dataloader
arguments toTrainer.fit()
, for alternative data parsing (#759)Added Tensor Processing Unit (TPU) support (#868)
Split callbacks in multiple files (#849)
Added support for multiple loggers to be passed to
Trainer
as an iterable (e.g. list, tuple, etc.) (#903)Added support for step-based learning rate scheduling (#941)
Added support for logging
hparams
as dict (#1029)Checkpoint and early stopping now work without val. step (#1041)
Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
Added type hints for function arguments (#912, )
Added TPU gradient clipping (#963)
Added max/min number of steps in
Trainer
(#728)
[0.7.0] - Changed¶
Improved
NeptuneLogger
by addingclose_after_fit
argument to allow logging after training(#908)Changed default TQDM to use
tqdm.auto
for prettier outputs in IPython notebooks (#752)Changed
pytorch_lightning.logging
topytorch_lightning.loggers
(#767)Moved the default
tqdm_dict
definition from Trainer toLightningModule
, so it can be overridden by the user (#749)Moved functionality of
LightningModule.load_from_metrics
intoLightningModule.load_from_checkpoint
(#995)Changed Checkpoint path parameter from
filepath
todirpath
(#1016)Freezed models
hparams
asNamespace
property (#1029)Dropped
logging
config in package init (#1015)Renames model steps (#1051)
training_end
>>training_epoch_end
validation_end
>>validation_epoch_end
test_end
>>test_epoch_end
Refactor dataloading, supports infinite dataloader (#955)
Create single file in
TensorBoardLogger
(#777)
[0.7.0] - Deprecated¶
[0.7.0] - Removed¶
[0.7.0] - Fixed¶
Fixed a bug where early stopping
on_end_epoch
would be called inconsistently whencheck_val_every_n_epoch == 0
(#743)Fixed a bug where the model checkpointer didn’t write to the same directory as the logger (#771)
Fixed a bug where the
TensorBoardLogger
class would create an additional empty log file during fitting (#777)Fixed a bug where
global_step
was advanced incorrectly when usingaccumulate_grad_batches > 1
(#832)Fixed a bug when calling
self.logger.experiment
with multiple loggers (#1009)Fixed a bug when calling
logger.append_tags
on aNeptuneLogger
with a single tag (#1009)Fixed sending back data from
.spawn
by saving and loading the trained model in/out of the process (#1017Fixed port collision on DDP (#1010)
Fixed/tested pass overrides (#918)
Fixed comet logger to log after train (#892)
Remove deprecated args to learning rate step function (#890)
[0.6.0] - 2020-01-21¶
[0.6.0] - Added¶
Added support for resuming from a specific checkpoint via
resume_from_checkpoint
argument (#516)Added support for
ReduceLROnPlateau
scheduler (#320)Added support for Apex mode
O2
in conjunction with Data Parallel (#493)Added option (
save_top_k
) to save the top k models in theModelCheckpoint
class (#128)Added
on_train_start
andon_train_end
hooks toModelHooks
(#598)Added
TensorBoardLogger
(#607)Added support for weight summary of model with multiple inputs (#543)
Added
map_location
argument toload_from_metrics
andload_from_checkpoint
(#625)Added option to disable validation by setting
val_percent_check=0
(#649)Added
NeptuneLogger
class (#648)Added
WandbLogger
class (#627)
[0.6.0] - Changed¶
Changed the default progress bar to print to stdout instead of stderr (#531)
Renamed
step_idx
tostep
,epoch_idx
toepoch
,max_num_epochs
tomax_epochs
andmin_num_epochs
tomin_epochs
(#589)Renamed
total_batch_nb
tototal_batches
,nb_val_batches
tonum_val_batches
,nb_training_batches
tonum_training_batches
,max_nb_epochs
tomax_epochs
,min_nb_epochs
tomin_epochs
,nb_test_batches
tonum_test_batches
, andnb_val_batches
tonum_val_batches
(#567)Changed gradient logging to use parameter names instead of indexes (#660)
Changed the default logger to
TensorBoardLogger
(#609)Changed the directory for tensorboard logging to be the same as model checkpointing (#706)
[0.6.0] - Deprecated¶
[0.6.0] - Removed¶
Removed the
save_best_only
argument fromModelCheckpoint
, usesave_top_k=1
instead (#128)
[0.6.0] - Fixed¶
Fixed a bug which ocurred when using Adagrad with cuda (#554)
Fixed a bug where training would be on the GPU despite setting
gpus=0
orgpus=[]
(#561)Fixed an error with
print_nan_gradients
when some parameters do not require gradient (#579)Fixed a bug where the progress bar would show an incorrect number of total steps during the validation sanity check when using multiple validation data loaders (#597)
Fixed support for PyTorch 1.1.0 (#552)
Fixed an issue with early stopping when using a
val_check_interval < 1.0
inTrainer
(#492)Fixed bugs relating to the
CometLogger
object that would cause it to not work properly (#481)Fixed a bug that would occur when returning
-1
fromon_batch_start
following an early exit or when the batch wasNone
(#509)Fixed a potential race condition with several processes trying to create checkpoint directories (#530)
Fixed a bug where batch ‘segments’ would remain on the GPU when using
truncated_bptt > 1
(#532)Fixed a bug when using
IterableDataset
(#547)Fixed a bug where
.item
was called on non-tensor objects (#602)Fixed a bug where
Trainer.train
would crash on an uninitialized variable if the trainer was run after resuming from a checkpoint that was already atmax_epochs
(#608)Fixed a bug where early stopping would begin two epochs early (#617)
Fixed a bug where
num_training_batches
andnum_test_batches
would sometimes be rounded down to zero (#649)Fixed a bug where an additional batch would be processed when manually setting
num_training_batches
(#653)Fixed a bug when batches did not have a
.copy
method (#701)Fixed a bug when using
log_gpu_memory=True
in Python 3.6 (#715)Fixed a bug where checkpoint writing could exit before completion, giving incomplete checkpoints (#689)
Fixed a bug where
on_train_end
was not called when ealy stopping (#723)
[0.5.3] - 2019-11-06¶
[0.5.3] - Added¶
Added option to disable default logger, checkpointer, and early stopping by passing
logger=False
,checkpoint_callback=False
andearly_stop_callback=False
respectivelyAdded
CometLogger
for use with Comet.mlAdded
val_check_interval
argument toTrainer
allowing validition to be performed at every given number of batchesAdded functionality to save and load hyperparameters using the standard checkpoint mechanism
Added call to
torch.cuda.empty_cache
before training startsAdded option for user to override the call t
backward
Added support for truncated backprop through time via the
truncated_bptt_steps
argument inTrainer
Added option to operate on all outputs from
training_step
in DDP2Added a hook for modifying DDP init
Added a hook for modifying Apex
[0.5.3] - Changed¶
Changed experiment version to be padded with zeros (e.g.
/dir/version_9
becomes/dir/version_0009
)Changed callback metrics to include any metrics given in logs or progress bar
Changed the default for
save_best_only
inModelCheckpoint
toTrue
Added
tng_data_loader
for backwards compatibilityRenamed
MLFlowLogger.client
toMLFlowLogger.experiment
for consistencyMoved
global_step
increment to happen after the batch has been processedChanged weights restore to first attempt HPC weights before restoring normally, preventing both weights being restored and running out of memory
Changed progress bar functionality to add multiple progress bars for train/val/test
Changed calls to
print
to uselogging
instead
[0.5.3] - Deprecated¶
Deprecated
tng_dataloader
[0.5.3] - Fixed¶
Fixed an issue where the number of batches was off by one during training
Fixed a bug that occured when setting a ckeckpoint callback and
early_stop_callback=False
Fixed an error when importing CometLogger
Fixed a bug where the
gpus
argument had some unexpected behaviourFixed a bug where the computed total number of batches was sometimes incorrect
Fixed a bug where the progress bar would sometimes not show the total number of batches in test mode
Fixed a bug when using the
log_gpu_memory='min_max'
option inTrainer
Fixed a bug where checkpointing would sometimes erase the current directory
[0.5.2] - 2019-10-10¶
[0.5.2] - Added¶
Added
weights_summary
argument toTrainer
to be set tofull
(full summary),top
(just top level modules) or otherAdded
tags
argument toMLFlowLogger
[0.5.2] - Changed¶
Changed default for
amp_level
toO1
[0.5.2] - Removed¶
Removed the
print_weights_summary
argument fromTrainer
[0.5.2] - Fixed¶
Fixed a bug where logs were not written properly
Fixed a bug where
logger.finalize
wasn’t called after training is completeFixed callback metric errors in DDP
Fixed a bug where
TestTubeLogger
didn’t log to the correct directory
[0.5.1] - 2019-10-05¶
[0.5.1] - Added¶
Added the
LightningLoggerBase
class for experiment loggersAdded
MLFlowLogger
for logging withmlflow
Added
TestTubeLogger
for logging withtest_tube
Added a different implementation of DDP (
distributed_backed='ddp2'
) where every node has one model using all GPUsAdded support for optimisers which require a closure (e.g. LBFGS)
Added automatic
MASTER_PORT
defualt for DDP when not set manuallyAdded new GPU memory logging options
'min_max'
(log only the min/max utilization) and'all'
(log all the GPU memory)
[0.5.1] - Changed¶
Changed schedulers to always be called with the current epoch
Changed
test_tube
to an optional dependencyChanged data loaders to internally use a getter instead of a python property
Disabled auto GPU loading when restoring weights to prevent out of memory errors
Changed logging, early stopping and checkpointing to occur by default
[0.5.1] - Fixed¶
Fixed a bug with samplers that do not specify
set_epoch
Fixed a bug when using the
MLFlowLogger
with unsupported data types, this will now raise a warningFixed a bug where gradient norms were alwasy zero using
track_grad_norm
Fixed a bug which causes a crash when logging memory
[0.5.0] - 2019-09-26¶
[0.5.0] - Changed¶
Changed
data_batch
argument tobatch
throughoutChanged
batch_i
argument tobatch_idx
throughoutChanged
tng_dataloader
method totrain_dataloader
Changed
on_tng_metrics
method toon_training_metrics
Changed
gradient_clip
argument togradient_clip_val
Changed
add_log_row_interval
torow_log_interval
[0.5.0] - Fixed¶
Fixed a bug with tensorboard logging in multi-gpu setup
[0.4.9] - 2019-09-16¶
[0.4.9] - Added¶
Added the flag
log_gpu_memory
toTrainer
to deactivate logging of GPU memory utilizationAdded SLURM resubmit functionality (port from test-tube)
Added optional weight_save_path to trainer to remove the need for a checkpoint_callback when using cluster training
Added option to use single gpu per node with
DistributedDataParallel
[0.4.9] - Changed¶
Changed functionality of
validation_end
andtest_end
with multiple dataloaders to be given all of the dataloaders at once rather than in seperate callsChanged print_nan_grads to only print the parameter value and gradients when they contain NaN
Changed gpu API to take integers as well (e.g.
gpus=2
instead ofgpus=[0, 1]
)All models now loaded on to CPU to avoid device and out of memory issues in PyTorch
[0.4.9] - Fixed¶
Fixed a bug where data types that implement
.to
but not.cuda
would not be properly moved onto the GPUFixed a bug where data would not be re-shuffled every epoch when using a
DistributedSampler
[0.4.8] - 2019-08-31¶
[0.4.8] - Added¶
Added
test_step
andtest_end
methods, used whenTrainer.test
is calledAdded
GradientAccumulationScheduler
callback which can be used to schedule changes to the number of accumulation batchesAdded option to skip the validation sanity check by setting
nb_sanity_val_steps = 0
[0.4.8] - Fixed¶
Fixed a bug when setting
nb_sanity_val_steps = 0
[0.4.7] - 2019-08-24¶
[0.4.7] - Changed¶
Changed the default
val_check_interval
to1.0
Changed defaults for
nb_val_batches
,nb_tng_batches
andnb_test_batches
to 0
[0.4.7] - Fixed¶
Fixed a bug where the full validation set as used despite setting
val_percent_check
Fixed a bug where an
Exception
was thrown when using a data set containing a single batchFixed a bug where an
Exception
was thrown if noval_dataloader
was givenFixed a bug where tuples were not properly transfered to the GPU
Fixed a bug where data of a non standard type was not properly handled by the trainer
Fixed a bug when loading data as a tuple
Fixed a bug where
AttributeError
could be suppressed by theTrainer
[0.4.6] - 2019-08-15¶
[0.4.6] - Added¶
Added support for data to be given as a
dict
orlist
with a single gpuAdded support for
configure_optimizers
to return a single optimizer, two list (optimizers and schedulers), or a single list
[0.4.6] - Fixed¶
Fixed a bug where returning just an optimizer list (i.e. without schedulers) from
configure_optimizers
would throw anException
[0.4.5] - 2019-08-13¶
[0.4.5] - Added¶
Added
optimizer_step
method that can be overridden to change the standard optimizer behaviour
[0.4.4] - 2019-08-12¶
[0.4.4] - Added¶
Added supoort for multiple validation dataloaders
Added support for latest test-tube logger (optimised for
torch==1.2.0
)
[0.4.4] - Changed¶
validation_step
andval_dataloader
are now optionallr_scheduler
is now activated after epoch
[0.4.4] - Fixed¶
Fixed a bug where a warning would show when using
lr_scheduler
intorch>1.1.0
Fixed a bug where an
Exception
would be thrown if usingtorch.DistributedDataParallel
without using aDistributedSampler
, this now throws aWarning
instead
[0.4.3] - 2019-08-10¶
[0.4.3] - Fixed¶
Fixed a bug where accumulate gradients would scale the loss incorrectly
[0.4.2] - 2019-08-08¶
[0.4.2] - Changed¶
Changed install requirement to
torch==1.2.0
[0.4.1] - 2019-08-08¶
[0.4.1] - Changed¶
Changed install requirement to
torch==1.1.0
[0.4.0] - 2019-08-08¶
[0.4.0] - Added¶
Added 16-bit support for a single GPU
Added support for training continuation (preserves epoch, global step etc.)
[0.4.0] - Changed¶
Changed
training_step
andvalidation_step
, outputs will no longer be automatically reduced
[0.4.0] - Removed¶
Removed need for
Experiment
object inTrainer
[0.4.0] - Fixed¶
Fixed issues with reducing outputs from generative models (such as images and text)
[0.3.6] - 2019-07-25¶
[0.3.6] - Added¶
Added a decorator to do lazy data loading internally
[0.3.6] - Fixed¶
Fixed a bug where
Experiment
object was not process safe, potentially causing logs to be overwritten