DDP training and storing rank specific info in checkpoints #21097

bardsleypt · 2025-08-19T16:56:02Z

bardsleypt
Aug 19, 2025

I'm working on preserving state between start/stop of training runs in a manner that guarantees reproducible results. That is, I'd like to be able to stop my training at any given checkpoint, then restart the training from that checkpoint and finish to completion, and have these results match (exactly) the results obtained from a single continuous run. I've been able to do this on single node setups by storing the outputs of

torch.get_rng_state()
torch.cuda.get_rng_state()
np.random.get_state()
random.getstate()

within the model checkpoint, and using the corresponding set method upon loading the checkpoint. I've been performing the save/load routines within a custom pytorch_lightning.callbacks.Callback by overriding the on_save_checkpoint and on_load_checkpoint appropriately.

I'm now trying to perform the same checkpoint save/load procedure using a multi-node setup, with a DDP strategy. My attempt was to append the global-rank-specific rng states to the checkpoint dictionary, which I had thought would then be saved appropriately. However, when I executed the code, the only rng state that is preserved within the checkpoint dictionary, is the rank 0 state. Can someone please advise on how to preserve the rng states from other ranks within the checkpoint in a DDP setup? As a higher level question: if there is a better way to preserve these states between training runs rather than checkpoint storage and re-instantiation, that information would also be welcome.

The main Callback save routine I'm using is posted below. I've then been checking the contents of the saved checkpoint dictionary by using a manual torch.load() call.

python version: 3.9.12
pytorch version: 2.2.0+cu121
pytorch_lightning version: 2.2.0

class CustomCallback(ptl.callbacks.Callback):
    def on_save_checkpoint(self, trainer, module, checkpoint):
        # get random states
        state = {
            'torch': torch.get_rng_state().cpu(),
            'cuda': torch.cuda.get_rng_state().cpu() if torch.cuda.is_available() else None,
            'numpy': np.random.get_state(),
            'random': random.getstate(),
        }
        rank = trainer.global_rank
        checkpoint[f'state_{rank}'] = state.  # note: this key never appears in the saved checkpoint except for rank 0

        # note: this code *does* execute, I do see the saved data for each rank, 
        # but I'd rather store it cleanly in the checkpoint file
        torch.save(state, f'rng_state_{rank}.pt')

    def on_load_checkpoint(self, trainer, module, checkpoint):
        # pass for now, easy enough to update if I get the on_save_ method working appropriately
        pass

bardsleypt · 2025-08-19T17:42:34Z

bardsleypt
Aug 19, 2025
Author

I believe I have at least narrowed down why my approach is not working:

The CustomCallback is instantiated on each process (i.e., each rank)
The per-process-checkpoint-dictionary is updated on its own process with its corresponding rank information
The only checkpoint that is saved is the so-called "global-zero" process

This at least explains why I only see the rank-0 information in the saved checkpoint. So my question can now be reduced to:

Is there any way to synchronize and otherwise send the checkpoint dictionaries for each rank to the global-0 process?

As a workaround, I can do some pretty hacky temporary-save and load routines in the on_save_checkpoint method, but I'd prefer a cleaner way if anyone has any suggestions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP training and storing rank specific info in checkpoints #21097

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

DDP training and storing rank specific info in checkpoints #21097

Uh oh!

Uh oh!

bardsleypt Aug 19, 2025

Replies: 1 comment

Uh oh!

Uh oh!

bardsleypt Aug 19, 2025 Author

bardsleypt
Aug 19, 2025

bardsleypt
Aug 19, 2025
Author