v0.17.0
Major and breaking
The TRL v0.17 release introduces three major changes that, together, enable significantly faster generation performance in GRPO—up to 10x faster in some configurations.
These three changes are:
- Data parallelism (DP) for the vLLM server
- A new GRPO training strategy that generates once per effective batch
- Support for the V1 engine in vLLM
Below, we provide a summary of these changes and how to use them.
⚡ Up to 4x faster: Data Parallel for vLLM server
The TRL vLLM server now supports data parallelism (DP), enabling significantly faster generation speeds—especially for smaller models. This new feature can be used by adding the --data_parallel_size N
argument when launching the vLLM server.
trl vllm-serve --model Qwen/Qwen2.5-14B-Instruct --tensor_parallel_size 2 --data_parallel_size 2
by @qgallouedec in #3310
* ☝️ [GRPO] Generate once per effective batch
Previously, GRPO made one generation request per global batch. The global batch is the total of all local batches, without accounting for gradient accumulation. In other words, if the gradient accumulation step was 8, GRPO would make 8 generation requests per training step.
Now, GRPO groups these global batches into a single "effective batch" and makes only one generation request per effective batch. Since vLLM applies optimizations that are especially effective for large batches, this new approach leads to significantly faster training overall.
No changes are required in the training script, as this is handled internally by the GRPO trainer.
by @qgallouedec in #3283
⏱️ Fix vLLM server to support V1 Engine
vLLM provides two versions of its engine (V0 and V1), and V1 is significantly faster. This version is now supported by TRL and requires vLLM version 0.8.3 or higher.
👎 [GRPO] Adds option to disable dropout
Disabling dropout has shown to stabilize training. You can now disable dropout in GRPO by setting the disable_dropout
argument to False
in the GRPO config.
from trl import GRPOConfig
training_args = GRPOConfig(..., disable_dropout=True)
by @edbeeching in #3234
🩺 Dr. GRPO loss
GRPO now supports the various losses proposed in the recent literature, including the Dr. GRPO loss. The loss type can be set in the GRPO config:
from trl import GRPOConfig
training_args = GRPOConfig(..., loss_type="dr_grpo")
by @qgallouedec in #3256
🎲 [GRPO] Make training dataset shuffle optional
The GRPO trainer now has an option to disable shuffling of the training dataset. This is useful for curriculum learning, where the order of the training data is important.
from trl import GRPOConfig
training_args = GRPOConfig(..., shuffle_dataset=False)
by @LeonEricsson in #3334
☕ Overlong-filtering for GRPO
Overlong filtering has been shown to significantly stabilize learning and improve performance. You can now use it in TRL!
It simply consists in masking the loss of truncated samples
from trl import GRPOConfig
training_args = GRPOConfig(..., mask_truncated_completions=True)
by @shirinyamani in #3248
🐯 Integrate Liger GRPO Loss to GRPO Trainer
Liger allows to significantly reduce the memory peak of the loss computation. You can now use it in TRL with the use_liger_loss
argument in the GRPO config:
from trl import GRPOConfig
training_args = GRPOConfig(..., use_liger_loss=True)
by @shivam15s in #3184
Bug fixes
- Fix: Multi gpu hang for ORPO and CPO Trainer by @NanoCode012 in #3069
- 📊 Fix
clip_ratio
logging and better document logged values by @qgallouedec in #3145 - ⏯️ Fix: handle None inputs when resuming GRPO Trainer from checkpoint by @PenutChen in #3148
- 📎 Fix is_clipped to compute the effective clip_ratio by @pandong2011 in #3175
- 😷 Fix SFT masking EOS when equal to PAD by @qgallouedec in #3200
- ⏯️ Fix logging when resuming from checkpoint GRPO by @qgallouedec in #3185
- 💠 Fix multi-gpu padding free by @qgallouedec in #3245
- 🕷 Fix online DPO crash when model is a DataParallel object by @wilrop in #3225
- 🏁 Fix adding special tokens in SFT by @qgallouedec in #3328
- 🍡 Fix using reward model and DeepSpeed ZeRO 3 by @qgallouedec in #3326
What's Changed
- Fix: Multi gpu hang for ORPO and CPO Trainer by @NanoCode012 in #3069
- 📊 Fix
clip_ratio
logging and better document logged values by @qgallouedec in #3145 - BCOTrainer version upgrade fixes by @claralp in #2867
- 🐇 [Research] Layer Skip SFT by @ariG23498 in #3111
- 🤝 Align GRPO equation doc with the implementation by @qgallouedec in #3151
- Enable number of printed completions to be set by @lewtun in #3149
- 🩹 Fix CI by @qgallouedec in #3155
- ⚰️ Remove deprecated by @qgallouedec in #3153
- 🔫 Disable triggering CI when PR is draft by @qgallouedec in #3154
- 👨🍳 vLLM serve: destroy process group on exit and pass
worker_cls
as string by @qgallouedec in #3159 - 💰 Richer rich table - log all the rewards by @qgallouedec in #3156
- 💎 Gemma 3 VLM SFT example script for single-image and multi-image by @sergiopaniego in #3131
- [Liger] Liger KTO support by @vaibhavjindal in #2812
- 🏃 Migrate CI to self-hosted runners by @qgallouedec in #3174
- ❤️🩹 [CI] fix transformers dev CI failure by @kashif in #3176
- ⏯️ Fix: handle None inputs when resuming GRPO Trainer from checkpoint by @PenutChen in #3148
- 📎 Fix is_clipped to compute the effective clip_ratio by @pandong2011 in #3175
- Fix breaking typo for flash_attention reducing_memory_usage.md by @burtenshaw in #3190
- Show unique prompts in GRPO WandB tables by @lewtun in #3191
- 🐗 [CI] Fix trufflehog false positives by @lewtun in #3192
- [GRPO] Improve completion length logging by @edbeeching in #3188
- 😷 Fix SFT masking EOS when equal to PAD by @qgallouedec in #3200
- 🗝️ Fix type hint in vLLM client by @qgallouedec in #3205
- 📚 Accumulate completions for logging by @lewtun in #3217
- Group completion metrics by common prefix by @lewtun in #3212
- 🐯 Integrate Liger GRPO Loss to GRPO Trainer by @shivam15s in #3184
- Update ruff to 11.3 and base Python version to 3.9 by @cyyever in #3230
- ⏯️ Fix logging when resuming from checkpoint GRPO by @qgallouedec in #3185
- 📢 Improve GRPO trainer error message for invalid num_generations by @AliBakly in #3199
- 🎀 Simplify logging text by @qgallouedec in #3219
- 🌊 Add error for iterable datasets in GRPOTrainer by @qgallouedec in #3216
- ⏳ PPOTrainer: fix progress bar for num_mini_batches > 1 by @dawidm in #2531
- ☑ Update PULL_REQUEST_TEMPLATE.md by @qgallouedec in #3241
- 🔭 Add support for better KL estimator (k3) in PPOTrainer by @AMindToThink in #3240
- 🏃 Fix and make CI faster by @qgallouedec in #3160
- 🗑️ Deprecate
ConstantLengthDataset
by @qgallouedec in #3242 - 📦 [SFT] Deprecate batched
formatting_func
by @YeFD in #3147 - 💠 Fix multi-gpu padding free by @qgallouedec in #3245
- ☕ Overlong-filtering for GRPO by @shirinyamani in #3248
- 📜 Fix license and copyrights by @qgallouedec in #3264
- ⛏️ Add cli dict parsing for grpo_config by @Tavish9 in #3082
- 🐯
is_liger_kernel_available
with min version by @qgallouedec in #3266 - 🕷 Fix online DPO crash when model is a DataParallel object by @wilrop in #3225
- 👎 [GRPO] Adds option to disable dropout by @edbeeching in #3234
- 🚧 Temporarily restrict diffusers to <0.33.0 due to ftfy optional dep issue breaking doc builds by @qgallouedec in #3273
- ♾️ [CI] Remove
test_raise_error_not_causallm
by @qgallouedec in #3265 - 🩺 Dr. GRPO loss by @qgallouedec in #3256
- 🔗 Fix Dr. GRPO paper link by @qgallouedec in #3275
- Add Fine-tuning a Multimodal Model Using SFT (Single or Multi-Image Dataset) guide to docs by @sergiopaniego in #3235
- 🕊️ Un-restrict diffusers by @qgallouedec in #3274
- 🦾 Test vLLM client-server by @qgallouedec in #3277
- ⏱️ Fix vLLM server to support V1 Engine by @I-l-l-I in #3276
- Expose EOS token in SFTConfig by @lewtun in #3299
- 🏷️ Fixed naming error in output_dir for Gemma 3 VLM script by @sergiopaniego in #3297
- 🧗 Add Ascend NPU support for vLLM server by @ji-huazhong in #3286
🅾️ Fixes typo in SFTTrainer by @taras-sereda in #3282- [GRPO] Add metrics for low and high clipped token probabilities by @lewtun in #3289
- ☝️ [GRPO] Generate once per effective batch by @qgallouedec in #3283
- 🎲 [GRPO] Make training dataset shuffle optional by @LeonEricsson in #3334
- 🙋 Add Optional Eager Execution Mode for vLLM Serving by @ucalyptus in #3335
- Fix typo in text_environments.md by @sunjin-k in #3305
- ✅ [doc] Update sft_trainer.md in table x->✓ by @HERIUN in #3313
- 🧸 Fix unset tokenizer pad_token by @LeonEricsson in #3290
- 💡 Fix type hint in
_generate_and_score_completions
by @syt-nju in #3336 - 🦄 Add optional uvicorn log level for vLLM serve by @I-l-l-I in #3338
- [CPO] Check that
max_prompt_length < max_length
by @LeonEricsson in #3341 - 🏁 Fix adding special tokens in SFT by @qgallouedec in #3328
- Define default chat template for SFT by @lewtun in #3309
- 🍡 Fix using reward model and DeepSpeed ZeRO 3 by @qgallouedec in #3326
- ⚡ Up to 4x faster: Data Parallel for vLLM server by @qgallouedec in #3310
- Release: v0.17 by @qgallouedec in #3356
New Contributors
- @NanoCode012 made their first contribution in #3069
- @ariG23498 made their first contribution in #3111
- @PenutChen made their first contribution in #3148
- @pandong2011 made their first contribution in #3175
- @shivam15s made their first contribution in #3184
- @cyyever made their first contribution in #3230
- @AMindToThink made their first contribution in #3240
- @YeFD made their first contribution in #3147
- @Tavish9 made their first contribution in #3082
- @wilrop made their first contribution in #3225
- @I-l-l-I made their first contribution in #3276
- @taras-sereda made their first contribution in #3282
- @LeonEricsson made their first contribution in #3334
- @ucalyptus made their first contribution in #3335
- @sunjin-k made their first contribution in #3305
- @HERIUN made their first contribution in #3313
- @syt-nju made their first contribution in #3336
Full Changelog: v0.16.0...v0.17.0