-
Notifications
You must be signed in to change notification settings - Fork 3.5k
CPU-Memory keeps accumulating during trainer.predict
#19398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@surajpaib In addition to your trainer.predict_loop._predictions = [[] for _ in range(trainer.predict_loop.num_dataloaders)] but based on your description ( |
|
To add to this, there is a minor difference in memory usage over time with and without clearing the Given that our batch inferences take long (3D images), the What I still don't get is how the memory accumulates when |
I have the same issue with PyTorch Lightning 2.5.0 and Torch 2.5.1 in a DDP setup. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team! |
I had a similar issue. Clearing |
Uh oh!
There was an error while loading. Please reload this page.
Bug description
This is very similar to closed issue #15656
I am working on predicting using PL Trainer on 3D images and these are huge, my process keeps getting killed when a large number of samples are to be predicted. I found #15656 and expected that to be the solution but setting
return_predictions=False
does not fix the memory accumulation.What seems to work instead is adding a
gc.collect()
in thepredict_loop
. This keeps CPU memory usage constant as would be expected.It seems like setting
return_predictions=False
should stop the memory accumulation but I'm confused as to why thegc.collect()
is needed.This is where the
gc.collect()
is applied: https://github.com/project-lighter/lighter/blob/07018bb2c66c0c8848bab748299e2c2d21c7d185/lighter/callbacks/writer/base.py#L120I've also attached a memory log using
scalene
of the return predictions and the gc collect comparison. As you can see, there is no memory growth for gc collect.Would you be able to provide any intuition on this? It would be much appreciated!
What version are you seeing the problem on?
v2.1
How to reproduce the bug
No response
Error messages and logs
gc_collect.pdf
return_predictions_false.pdf
Environment
Current environment
More info
No response
cc @lantiga @Borda
The text was updated successfully, but these errors were encountered: