Why wrapping only blocks with grad checkpoint in ViT #1303

DaniNem · 2022-06-13T15:09:08Z

DaniNem
Jun 13, 2022

Hey!

I'm trying to understand how to minimize the GPU memory during training and tried to use the set_grad_checkpointing with the ViT model, it seems like the checkpointing is done only on self.blocks, but not on the self.patch_embed, why is that?

Thanks!
Daniel

Answered by rwightman

Jun 13, 2022

@DaniNem only recent version of pytorch (I think 1.11+) allowed safely checkpointing the first block (once use_reentrant flag was added, and can be set to False), however, the additional gains are minimal so I opted to keep it simple and just checkpiont the blocks for all models where it made sense to do so

View full answer

rwightman · 2022-06-13T17:46:06Z

rwightman
Jun 13, 2022
Maintainer

@DaniNem only recent version of pytorch (I think 1.11+) allowed safely checkpointing the first block (once use_reentrant flag was added, and can be set to False), however, the additional gains are minimal so I opted to keep it simple and just checkpiont the blocks for all models where it made sense to do so

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Why wrapping only blocks with grad checkpoint in ViT #1303

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Why wrapping only blocks with grad checkpoint in ViT #1303

Uh oh!

DaniNem Jun 13, 2022

Replies: 1 comment

Uh oh!

Uh oh!

rwightman Jun 13, 2022 Maintainer

DaniNem
Jun 13, 2022

rwightman
Jun 13, 2022
Maintainer