generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Closed
Description
Reproduction
from datasets import load_dataset
from trl import DPOTrainer
from peft import LoraConfig
if __name__ == "__main__":
dataset = load_dataset("trl-internal-testing/zen", "standard_preference", split="train")
trainer = DPOTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
peft_config=LoraConfig(),
)
trainer.train()outputs:
/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/utils/checkpoint.py:85: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/utils/checkpoint.py:85: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
[rank1]: Traceback (most recent call last):
[rank1]: File "/fsx/qgallouedec/trl/bug_dpo_peft.py", line 13, in <module>
[rank1]: trainer.train()
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 2325, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 2674, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 4071, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/accelerate/accelerator.py", line 2852, in backward
[rank1]: loss.backward(**kwargs)
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/_tensor.py", line 647, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/function.py", line 311, in apply
[rank1]: return user_fn(self, *args)
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 319, in backward
[rank1]: torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank1]: Parameter at index 95 with name base_model.model.model.layers.23.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
[rank0]: Traceback (most recent call last):
[rank0]: File "/fsx/qgallouedec/trl/bug_dpo_peft.py", line 13, in <module>
[rank0]: trainer.train()
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 2325, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 2674, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/transformers/trainer.py", line 4071, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/accelerate/accelerator.py", line 2852, in backward
[rank0]: loss.backward(**kwargs)
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/_tensor.py", line 647, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/function.py", line 311, in apply
[rank0]: return user_fn(self, *args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 319, in backward
[rank0]: torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank0]: Parameter at index 95 with name base_model.model.model.layers.23.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
[rank0]:[W107 14:35:53.391642689 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
0%| | 0/6 [00:02<?, ?it/s]
System Info
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.12.12
- TRL version: 0.27.0.dev0+db868c5
- PyTorch version: 2.8.0
- accelerator(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
- Transformers version: 4.57.3
- Accelerate version: 1.12.0
- Accelerate config: not found
- Datasets version: 4.4.2
- HF Hub version: 0.36.0
- bitsandbytes version: 0.49.0
- DeepSpeed version: 0.18.3
- Liger-Kernel version: 0.6.4
- LLM-Blender version: 0.0.2
- OpenAI version: 2.8.1
- PEFT version: 0.18.0
- vLLM version: 0.10.2
Checklist
- I have checked that my issue isn't already filed (see open issues)
- I have included my system information
- Any code provided is minimal, complete, and reproducible (more on MREs)
- Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- Any traceback provided is complete
Metadata
Metadata
Assignees
Labels
No labels