Inprecision in pytorch-lightning's Gradient Accumulation? #18743
Unanswered
leehawk787
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In my understanding of Gradient Accumulation the result of computing with
batch_size = x
vs.accumulate_gradient = h, batch size = x/h
should be the same.So those three samples should compute the same thing in pytorch-lightning:
However, they do not. The weights of the model differ slightly (~ e⁻5) in all three cases after just a few hundred batches.
Here is a full reproducible example:
[python 3.11.5, pytorch-lightning 2.0.9]
I tried switching the learning rate, optimizer, dataset and model; the difference in the 3 versions persists. I also tried it on CPU and 2 different GPUs, difference persists.
I'd like to understand where this difference comes from and if it can be eradicated, any ideas?
Beta Was this translation helpful? Give feedback.
All reactions