Resuming WandB runs in PL #11702

rahulvigneswaran · 2022-02-02T17:11:00Z

rahulvigneswaran
Feb 2, 2022

When I start a run, I always generate a wandb id using wandb.util.generate_id() and save it alongside the ckpt.

When a run crashes, I try to resume the trainer by providing the appropriate ckpt_path in trainer.fit and try to resume the wandb logger by doing the following (cfg.LOGGER.WANDB_ID is the same wandb id that i saved in the earlier step),

wandb_logger = WandbLogger(project=cfg.LOGGER.PROJECT_NAME, 
                        name=LOGGER_NAME,
                        log_model=cfg.LOGGER.LOG_MODEL,
                        id=cfg.LOGGER.WANDB_ID,
                        resume="allow",
                        reinit=True)

While the trainer resumes without any hiccup, the wandb logger, instead of resuming the previous run, creates a new run and logs from the resumed epoch.

Where am I going wrong and how to fix it?

rahulvigneswaran · 2022-02-06T10:00:10Z

rahulvigneswaran
Feb 6, 2022
Author

Hey @williamFalcon , Any idea about this?

0 replies

daniellepintz · 2022-02-06T17:16:53Z

daniellepintz
Feb 6, 2022

@rahulvigneswaran could this be because you set reinit=True?

2 replies

rahulvigneswaran Feb 7, 2022
Author

@daniellepintz The servers that use to run my experiments are down. Give me a couple of days, I will get back.

rahulvigneswaran Feb 7, 2022
Author

@daniellepintz I gave it a try but unfortunately removing it/setting it to False didn't help.

konstantinjdobler · 2022-04-23T12:07:17Z

konstantinjdobler
Apr 23, 2022

@rahulvigneswaran I believe the W&B docs state that you should set resume="must" instead of "allow".

0 replies

talhaanwarch · 2022-08-13T06:34:28Z

talhaanwarch
Aug 13, 2022

We can resume it this way

from pytorch_lightning.loggers import WandbLogger
wandb_logger = WandbLogger(project="vitals-{}".format(vital),id='w5pelznp',resume='must')

i got the id from folder name, which look like run-20220812_220817-w5pelznp

1 reply

icekang Oct 17, 2023

This one works for me, thanks!

samils7 · 2023-01-10T14:08:37Z

samils7
Jan 10, 2023

Any other solutions?

0 replies

kabouzeid · 2023-01-14T23:12:37Z

kabouzeid
Jan 14, 2023

You can also set WANDB_RUN_ID=foo env variable.

0 replies

ivegner · 2023-03-24T23:48:57Z

ivegner
Mar 24, 2023

I've solved this in LightningCLI the following way:

# main.py
class MyLightningCLI(LightningCLI):
    def add_arguments_to_parser(self, parser):
        parser.add_argument(
            "--resume_run_id", default="", type=str, help="W&B run ID to resume from"
        )
        parser.link_arguments("resume_run_id", "trainer.logger.init_args.resume", compute_fn=lambda x: "must" if x else "never")

    def before_instantiate_classes(self):
        subcommand = self.config.subcommand
        c = self.config[subcommand]
        run_id = None

        if c.resume_run_id:
            run_id = c.resume_run_id
            api = wandb.Api()
            artifact = api.artifact(f'{c.trainer.logger.init_args.project}/model-{run_id}:latest', type="model")
            artifact_dir = artifact.download()
            c.ckpt_path=str(Path(artifact_dir) / "model.ckpt")
        else:
            run_id = wandb.util.generate_id()

        # also make sure that ModelCheckpoints go to the right place
        c.trainer.logger.init_args.id = run_id
        for callback in c.trainer.callbacks:
            if callback.class_path == "pytorch_lightning.callbacks.ModelCheckpoint":
                callback.init_args.dirpath = f"checkpoints/{run_id}"

if __name__ == "__main__":
    _cli = MyLightningCLI(
        ...
    )

and the relevant part of my config.yaml is

trainer:
  logger:
    class_path: pytorch_lightning.loggers.WandbLogger
    init_args:
      project: MyProject
      log_model: all
  callbacks:
  - class_path: pytorch_lightning.callbacks.ModelCheckpoint
    init_args:
      save_top_k: 1
      save_last: True
      monitor: val/total_loss
      filename: '{epoch}-{step}-{validation_loss:.3f}'

Now, I can just do

python main.py fit --config config.yaml --resume_run_id <some_run_id>

and it will download the latest checkpoint and restart appropriately using ckpt_path, and will start a new run if --resume_run_id is not provided.

0 replies

Resuming WandB runs in PL #11702

Uh oh!

Uh oh!

Replies: 7 comments · 3 replies

Uh oh!

rahulvigneswaran Feb 6, 2022 Author

Uh oh!

Uh oh!

rahulvigneswaran Feb 7, 2022 Author

Uh oh!

rahulvigneswaran Feb 7, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 7 comments 3 replies

rahulvigneswaran
Feb 6, 2022
Author

rahulvigneswaran Feb 7, 2022
Author

rahulvigneswaran Feb 7, 2022
Author