Resuming WandB runs in PL #11702
Replies: 7 comments 3 replies
-
Hey @williamFalcon , Any idea about this? |
Beta Was this translation helpful? Give feedback.
-
@rahulvigneswaran could this be because you set |
Beta Was this translation helpful? Give feedback.
-
@rahulvigneswaran I believe the W&B docs state that you should set |
Beta Was this translation helpful? Give feedback.
-
We can resume it this way
i got the id from folder name, which look like |
Beta Was this translation helpful? Give feedback.
-
Any other solutions? |
Beta Was this translation helpful? Give feedback.
-
You can also set |
Beta Was this translation helpful? Give feedback.
-
I've solved this in LightningCLI the following way: # main.py
class MyLightningCLI(LightningCLI):
def add_arguments_to_parser(self, parser):
parser.add_argument(
"--resume_run_id", default="", type=str, help="W&B run ID to resume from"
)
parser.link_arguments("resume_run_id", "trainer.logger.init_args.resume", compute_fn=lambda x: "must" if x else "never")
def before_instantiate_classes(self):
subcommand = self.config.subcommand
c = self.config[subcommand]
run_id = None
if c.resume_run_id:
run_id = c.resume_run_id
api = wandb.Api()
artifact = api.artifact(f'{c.trainer.logger.init_args.project}/model-{run_id}:latest', type="model")
artifact_dir = artifact.download()
c.ckpt_path=str(Path(artifact_dir) / "model.ckpt")
else:
run_id = wandb.util.generate_id()
# also make sure that ModelCheckpoints go to the right place
c.trainer.logger.init_args.id = run_id
for callback in c.trainer.callbacks:
if callback.class_path == "pytorch_lightning.callbacks.ModelCheckpoint":
callback.init_args.dirpath = f"checkpoints/{run_id}"
if __name__ == "__main__":
_cli = MyLightningCLI(
...
) and the relevant part of my trainer:
logger:
class_path: pytorch_lightning.loggers.WandbLogger
init_args:
project: MyProject
log_model: all
callbacks:
- class_path: pytorch_lightning.callbacks.ModelCheckpoint
init_args:
save_top_k: 1
save_last: True
monitor: val/total_loss
filename: '{epoch}-{step}-{validation_loss:.3f}' Now, I can just do
and it will download the latest checkpoint and restart appropriately using |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
When I start a run, I always generate a wandb id using
wandb.util.generate_id()
and save it alongside the ckpt.When a run crashes, I try to resume the trainer by providing the appropriate
ckpt_path
intrainer.fit
and try to resume the wandb logger by doing the following (cfg.LOGGER.WANDB_ID is the same wandb id that i saved in the earlier step),While the trainer resumes without any hiccup, the wandb logger, instead of resuming the previous run, creates a new run and logs from the resumed epoch.
Where am I going wrong and how to fix it?
Beta Was this translation helpful? Give feedback.
All reactions