run_training_epoch duration increases with more epochs #17694

nilsleh · 2023-05-25T15:14:32Z

nilsleh
May 25, 2023

I have a LightningModule, DataModule and Trainer that I am using on a Regression Problem. I have observed that as epochs increase, the iterations/s on the tqdm bar decrease significantly by a factor of about 2-5. To look into this I used the SimpleProfiler and recorded run_training_epoch at each epoch inside on_train_epoch_end():

run_training_epoch = self.trainer.profiler.recorded_durations["run_training_epoch"][-1]
self.log("run_training_epoch", run_training_epoch)

when I plot these after 1000 epochs, I get the following:

I cannot share the full example that produced the plot above, but I tried to create a small toy example in a google colab notebook. The trend is not as severe as it is in the above picture but still there, so I am wondering where else the source of this could be as an individual training batch or the optimization step is not having this stark linear trend.

I have tried with lightning=2.0.2 and pytorch_lighning=1.9.5.

moajjem04 · 2023-07-25T17:13:16Z

moajjem04
Jul 25, 2023

I am also having issues with this. I tried training a CycleGAN model on night2day dataset with a batch size of 1. The training time increased from 0.17s/batch to 6s/batch. I looked through the forum but could not get any info.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

run_training_epoch duration increases with more epochs #17694

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

run_training_epoch duration increases with more epochs #17694

Uh oh!

Uh oh!

nilsleh May 25, 2023

Replies: 1 comment

Uh oh!

moajjem04 Jul 25, 2023

nilsleh
May 25, 2023

moajjem04
Jul 25, 2023