Why pytorch-lightning cost more gpu-memory than pytorch? #6653

dalek-who · 2021-03-23T13:21:22Z

dalek-who
Mar 23, 2021

This is my-gpu usage, The up is pytorch-lightning and the down is pure pytorch, with same model, same batch_size, same data and same data-order, but pytorch-lightning use much more gpu-memory.

I use only one GPU, and here's my trainer:

trainer = pl.Trainer(
        default_root_dir=config_file.DIR_OUTPUT,

        # basic settings
        max_epochs=config_train.num_train_epochs,
        min_epochs=1,
        # check_val_every_n_epoch=1,
        val_check_interval=1 / config_train.val_frequence_each_epoch,
        reload_dataloaders_every_epoch=config_train.reload_dataloaders_every_epoch,
        gradient_clip_val=gradient_clip_val,
        accumulate_grad_batches=config_train.gradient_accumulation_steps,

        log_every_n_steps=1,
        flush_logs_every_n_steps=1,

        # resume training checkpoint
        resume_from_checkpoint=resume_from_checkpoint,

        # backend
        deterministic=True,
        gpus=1 if torch.cuda.is_available() and not args.no_cuda else 0,  # -1: use all availavle gpu
        precision=16 if args.fp16 else 32,  # 16 or 32, 浮点数精度

        # extension
        callbacks=callbacks,
        logger=tensorboard_logger,
        profiler=profiler,

        # fast train to testing code
        # default: 1.0 (warning: should be 1.0, not 1, 1 means 1 batch)
        limit_train_batches=args.limit_train_percent,
        limit_val_batches=args.limit_val_percent,
        limit_test_batches=args.limit_test_percent,

        # useful auto tune functions
        auto_scale_batch_size=args.do_tune and args.auto_scale_batch_size,
        auto_lr_find=args.do_tune and args.auto_lr_find,
    )

and during training, I check trainer.accelerator is pytorch_lightning.accelerators.gpu.GPUAccelerator, is this accelerator same as pure pytorch with single GPU?

dalek-who · 2021-03-23T13:27:32Z

dalek-who
Mar 23, 2021
Author

Also, the most confusion is that the loss curves and final performance between pl and pure pytorch are very different, but I check during the first several steps, loss are the same. Whether GPUAccelerator do something silently and after many steps, the gradients and weights gets more and more different?

0 replies

ccchobits · 2021-09-30T19:57:54Z

ccchobits
Sep 30, 2021

Hi @dalek-who ,
I am suffering from the same problem. Have you resolve this issue? Thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why pytorch-lightning cost more gpu-memory than pytorch? #6653

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Why pytorch-lightning cost more gpu-memory than pytorch? #6653

Uh oh!

dalek-who Mar 23, 2021

Replies: 2 comments

Uh oh!

Uh oh!

dalek-who Mar 23, 2021 Author

Uh oh!

Uh oh!

ccchobits Sep 30, 2021

dalek-who
Mar 23, 2021

dalek-who
Mar 23, 2021
Author

ccchobits
Sep 30, 2021