Release v0.17.0 · huggingface/trl

Major and breaking

The TRL v0.17 release introduces three major changes that, together, enable significantly faster generation performance in GRPO—up to 10x faster in some configurations.

These three changes are:

Data parallelism (DP) for the vLLM server
A new GRPO training strategy that generates once per effective batch
Support for the V1 engine in vLLM

Below, we provide a summary of these changes and how to use them.

⚡ Up to 4x faster: Data Parallel for vLLM server

The TRL vLLM server now supports data parallelism (DP), enabling significantly faster generation speeds—especially for smaller models. This new feature can be used by adding the --data_parallel_size N argument when launching the vLLM server.

trl vllm-serve --model Qwen/Qwen2.5-14B-Instruct --tensor_parallel_size 2 --data_parallel_size 2

by @qgallouedec in #3310

* ☝️ [GRPO] Generate once per effective batch

Previously, GRPO made one generation request per global batch. The global batch is the total of all local batches, without accounting for gradient accumulation. In other words, if the gradient accumulation step was 8, GRPO would make 8 generation requests per training step.

Now, GRPO groups these global batches into a single "effective batch" and makes only one generation request per effective batch. Since vLLM applies optimizations that are especially effective for large batches, this new approach leads to significantly faster training overall.

No changes are required in the training script, as this is handled internally by the GRPO trainer.

by @qgallouedec in #3283

⏱️ Fix vLLM server to support V1 Engine

vLLM provides two versions of its engine (V0 and V1), and V1 is significantly faster. This version is now supported by TRL and requires vLLM version 0.8.3 or higher.

by @I-l-l-I in #3276

👎 [GRPO] Adds option to disable dropout

Disabling dropout has shown to stabilize training. You can now disable dropout in GRPO by setting the disable_dropout argument to False in the GRPO config.

from trl import GRPOConfig

training_args = GRPOConfig(..., disable_dropout=True)

by @edbeeching in #3234

🩺 Dr. GRPO loss

GRPO now supports the various losses proposed in the recent literature, including the Dr. GRPO loss. The loss type can be set in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., loss_type="dr_grpo")

by @qgallouedec in #3256

🎲 [GRPO] Make training dataset shuffle optional

The GRPO trainer now has an option to disable shuffling of the training dataset. This is useful for curriculum learning, where the order of the training data is important.

from trl import GRPOConfig

training_args = GRPOConfig(..., shuffle_dataset=False)

by @LeonEricsson in #3334

☕ Overlong-filtering for GRPO

Overlong filtering has been shown to significantly stabilize learning and improve performance. You can now use it in TRL!

It simply consists in masking the loss of truncated samples

from trl import GRPOConfig

training_args = GRPOConfig(..., mask_truncated_completions=True)

by @shirinyamani in #3248

🐯 Integrate Liger GRPO Loss to GRPO Trainer

Liger allows to significantly reduce the memory peak of the loss computation. You can now use it in TRL with the use_liger_loss argument in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., use_liger_loss=True)

by @shivam15s in #3184

Bug fixes

Fix: Multi gpu hang for ORPO and CPO Trainer by @NanoCode012 in #3069
📊 Fix clip_ratio logging and better document logged values by @qgallouedec in #3145
⏯️ Fix: handle None inputs when resuming GRPO Trainer from checkpoint by @PenutChen in #3148
📎 Fix is_clipped to compute the effective clip_ratio by @pandong2011 in #3175
😷 Fix SFT masking EOS when equal to PAD by @qgallouedec in #3200
⏯️ Fix logging when resuming from checkpoint GRPO by @qgallouedec in #3185
💠 Fix multi-gpu padding free by @qgallouedec in #3245
🕷 Fix online DPO crash when model is a DataParallel object by @wilrop in #3225
🏁 Fix adding special tokens in SFT by @qgallouedec in #3328
🍡 Fix using reward model and DeepSpeed ZeRO 3 by @qgallouedec in #3326

What's Changed

Fix: Multi gpu hang for ORPO and CPO Trainer by @NanoCode012 in #3069
📊 Fix clip_ratio logging and better document logged values by @qgallouedec in #3145
BCOTrainer version upgrade fixes by @claralp in #2867
🐇 [Research] Layer Skip SFT by @ariG23498 in #3111
🤝 Align GRPO equation doc with the implementation by @qgallouedec in #3151
Enable number of printed completions to be set by @lewtun in #3149
🩹 Fix CI by @qgallouedec in #3155
⚰️ Remove deprecated by @qgallouedec in #3153
🔫 Disable triggering CI when PR is draft by @qgallouedec in #3154
👨‍🍳 vLLM serve: destroy process group on exit and pass worker_cls as string by @qgallouedec in #3159
💰 Richer rich table - log all the rewards by @qgallouedec in #3156
💎 Gemma 3 VLM SFT example script for single-image and multi-image by @sergiopaniego in #3131
[Liger] Liger KTO support by @vaibhavjindal in #2812
🏃 Migrate CI to self-hosted runners by @qgallouedec in #3174
❤️‍🩹 [CI] fix transformers dev CI failure by @kashif in #3176
⏯️ Fix: handle None inputs when resuming GRPO Trainer from checkpoint by @PenutChen in #3148
📎 Fix is_clipped to compute the effective clip_ratio by @pandong2011 in #3175
Fix breaking typo for flash_attention reducing_memory_usage.md by @burtenshaw in #3190
Show unique prompts in GRPO WandB tables by @lewtun in #3191
🐗 [CI] Fix trufflehog false positives by @lewtun in #3192
[GRPO] Improve completion length logging by @edbeeching in #3188
😷 Fix SFT masking EOS when equal to PAD by @qgallouedec in #3200
🗝️ Fix type hint in vLLM client by @qgallouedec in #3205
📚 Accumulate completions for logging by @lewtun in #3217
Group completion metrics by common prefix by @lewtun in #3212
🐯 Integrate Liger GRPO Loss to GRPO Trainer by @shivam15s in #3184
Update ruff to 11.3 and base Python version to 3.9 by @cyyever in #3230
⏯️ Fix logging when resuming from checkpoint GRPO by @qgallouedec in #3185
📢 Improve GRPO trainer error message for invalid num_generations by @AliBakly in #3199
🎀 Simplify logging text by @qgallouedec in #3219
🌊 Add error for iterable datasets in GRPOTrainer by @qgallouedec in #3216
⏳ PPOTrainer: fix progress bar for num_mini_batches > 1 by @dawidm in #2531
☑ Update PULL_REQUEST_TEMPLATE.md by @qgallouedec in #3241
🔭 Add support for better KL estimator (k3) in PPOTrainer by @AMindToThink in #3240
🏃 Fix and make CI faster by @qgallouedec in #3160
🗑️ Deprecate ConstantLengthDataset by @qgallouedec in #3242
📦 [SFT] Deprecate batched formatting_func by @YeFD in #3147
💠 Fix multi-gpu padding free by @qgallouedec in #3245
☕ Overlong-filtering for GRPO by @shirinyamani in #3248
📜 Fix license and copyrights by @qgallouedec in #3264
⛏️ Add cli dict parsing for grpo_config by @Tavish9 in #3082
🐯 is_liger_kernel_available with min version by @qgallouedec in #3266
🕷 Fix online DPO crash when model is a DataParallel object by @wilrop in #3225
👎 [GRPO] Adds option to disable dropout by @edbeeching in #3234
🚧 Temporarily restrict diffusers to <0.33.0 due to ftfy optional dep issue breaking doc builds by @qgallouedec in #3273
♾️ [CI] Remove test_raise_error_not_causallm by @qgallouedec in #3265
🩺 Dr. GRPO loss by @qgallouedec in #3256
🔗 Fix Dr. GRPO paper link by @qgallouedec in #3275
Add Fine-tuning a Multimodal Model Using SFT (Single or Multi-Image Dataset) guide to docs by @sergiopaniego in #3235
🕊️ Un-restrict diffusers by @qgallouedec in #3274
🦾 Test vLLM client-server by @qgallouedec in #3277
⏱️ Fix vLLM server to support V1 Engine by @I-l-l-I in #3276
Expose EOS token in SFTConfig by @lewtun in #3299
🏷️ Fixed naming error in output_dir for Gemma 3 VLM script by @sergiopaniego in #3297
🧗 Add Ascend NPU support for vLLM server by @ji-huazhong in #3286
🅾️ Fixes typo in SFTTrainer by @taras-sereda in #3282
[GRPO] Add metrics for low and high clipped token probabilities by @lewtun in #3289
☝️ [GRPO] Generate once per effective batch by @qgallouedec in #3283
🎲 [GRPO] Make training dataset shuffle optional by @LeonEricsson in #3334
🙋 Add Optional Eager Execution Mode for vLLM Serving by @ucalyptus in #3335
Fix typo in text_environments.md by @sunjin-k in #3305
✅ [doc] Update sft_trainer.md in table x->✓ by @HERIUN in #3313
🧸 Fix unset tokenizer pad_token by @LeonEricsson in #3290
💡 Fix type hint in _generate_and_score_completions by @syt-nju in #3336
🦄 Add optional uvicorn log level for vLLM serve by @I-l-l-I in #3338
[CPO] Check that max_prompt_length < max_length by @LeonEricsson in #3341
🏁 Fix adding special tokens in SFT by @qgallouedec in #3328
Define default chat template for SFT by @lewtun in #3309
🍡 Fix using reward model and DeepSpeed ZeRO 3 by @qgallouedec in #3326
⚡ Up to 4x faster: Data Parallel for vLLM server by @qgallouedec in #3310
Release: v0.17 by @qgallouedec in #3356

New Contributors

@NanoCode012 made their first contribution in #3069
@ariG23498 made their first contribution in #3111
@PenutChen made their first contribution in #3148
@pandong2011 made their first contribution in #3175
@shivam15s made their first contribution in #3184
@cyyever made their first contribution in #3230
@AMindToThink made their first contribution in #3240
@YeFD made their first contribution in #3147
@Tavish9 made their first contribution in #3082
@wilrop made their first contribution in #3225
@I-l-l-I made their first contribution in #3276
@taras-sereda made their first contribution in #3282
@LeonEricsson made their first contribution in #3334
@ucalyptus made their first contribution in #3335
@sunjin-k made their first contribution in #3305
@HERIUN made their first contribution in #3313
@syt-nju made their first contribution in #3336

Full Changelog: v0.16.0...v0.17.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.17.0