Skip to content

RuntimeError: Expected to mark a variable ready only once when training with PEFT #4782

@qgallouedec

Description

@qgallouedec

Reproduction

from datasets import load_dataset
from trl import DPOTrainer
from peft import LoraConfig

if __name__ == "__main__":
    dataset = load_dataset("trl-internal-testing/zen", "standard_preference", split="train")

    trainer = DPOTrainer(
        model="Qwen/Qwen2.5-0.5B",
        train_dataset=dataset,
        peft_config=LoraConfig(),
    )
    trainer.train()

outputs:

/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/utils/checkpoint.py:85: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/utils/checkpoint.py:85: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
[rank1]: Traceback (most recent call last):
[rank1]:   File "/fsx/qgallouedec/trl/bug_dpo_peft.py", line 13, in <module>
[rank1]:     trainer.train()
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 2325, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 2674, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 4071, in training_step
[rank1]:     self.accelerator.backward(loss, **kwargs)
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/accelerate/accelerator.py", line 2852, in backward
[rank1]:     loss.backward(**kwargs)
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/_tensor.py", line 647, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/function.py", line 311, in apply
[rank1]:     return user_fn(self, *args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 319, in backward
[rank1]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank1]: Parameter at index 95 with name base_model.model.model.layers.23.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/fsx/qgallouedec/trl/bug_dpo_peft.py", line 13, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 2325, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 2674, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 4071, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/accelerate/accelerator.py", line 2852, in backward
[rank0]:     loss.backward(**kwargs)
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/_tensor.py", line 647, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/function.py", line 311, in apply
[rank0]:     return user_fn(self, *args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 319, in backward
[rank0]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank0]: Parameter at index 95 with name base_model.model.model.layers.23.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
[rank0]:[W107 14:35:53.391642689 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
  0%|          | 0/6 [00:02<?, ?it/s] 

System Info

  • Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
  • Python version: 3.12.12
  • TRL version: 0.27.0.dev0+db868c5
  • PyTorch version: 2.8.0
  • accelerator(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
  • Transformers version: 4.57.3
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • Datasets version: 4.4.2
  • HF Hub version: 0.36.0
  • bitsandbytes version: 0.49.0
  • DeepSpeed version: 0.18.3
  • Liger-Kernel version: 0.6.4
  • LLM-Blender version: 0.0.2
  • OpenAI version: 2.8.1
  • PEFT version: 0.18.0
  • vLLM version: 0.10.2

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions