feat: add thinking budget with early-stopping prompt injection#1196
feat: add thinking budget with early-stopping prompt injection#1196Thump604 wants to merge 4 commits intoml-explore:mainfrom
Conversation
Adds make_thinking_budget_processor() to sample_utils.py. The stateful logits processor tracks think_start/think_end tokens in the generated sequence and, once the thinking token count reaches the budget, injects early_stop_tokens one at a time by masking all other logits to -inf. Includes 7 unit tests covering passthrough, budget enforcement, full injection sequence, post-injection reset, and natural end-of-thinking.
Add thinking_budget, think_start_tokens, think_end_tokens, and early_stop_tokens parameters to make_logits_processors(). When all four are provided the function appends a thinking budget processor to the list. Tests confirm processor is created when params are set and skipped when they are None.
Add build_early_stop_tokens() to sample_utils for constructing the injection token sequence from a tokenizer. Wire it into the server: _make_logits_processors() now accepts a tokenizer and populates the thinking_budget kwargs when the tokenizer has_thinking. Add thinking_budget and thinking_budget_message fields to GenerationArguments and parse them from the request body. Both the batched and sequential generation paths pass tokenizer through.
Chat templates with enable_thinking=True add <think> to the prompt, not the generated output. The processor never saw the opening tag and never entered thinking state. Add initial_thinking parameter so callers can signal that the prompt already opened a thinking block. The server detects this by scanning prompt tokens for think_start/think_end positions.
|
Also, as already said there: the verify pass advances the processor state for |
|
Noted on the drift. The thinking budget counter would over-count on rejections since the verify pass advances it for bonus_tok even when rejected. With the current interface, the only workaround is to make the processor stateless (recompute count from prev_tokens on each call) or accept the drift as bounded noise. A save_state/restore_state protocol on the processor interface would fix this properly for all stateful processors. Will track separately. |
Summary
Adds
thinking_budgetsupport using Qwen's recommended early-stopping prompt injection. When thinking tokens exceed the budget, the processor forces a transition prompt token-by-token through the model's forward pass, so the model reads the injection through full attention and transitions to content from a coherent hidden state.make_thinking_budget_processor()insample_utils.py-- stateful logits processorbuild_early_stop_tokens()-- tokenizes transition message + think-end tokensmake_logits_processors()-- wired withthinking_budgetparamsthinking_budgetandthinking_budget_messagefrom chat completion requestsinitial_thinkingdetects when the prompt already opened a<think>blockWhy this approach
The existing approaches in vLLM and llama.cpp force
</think>via logit manipulation, but the model's hidden state never processes the transition -- it generates content from an incomplete thought state. Qwen's official recommendation for open-source inference is to inject a natural-language prompt that the model reads through attention, producing a coherent transition.Each forced token goes through
generate_step's normal_step()call, which runs the full model forward pass and updates the KV cache. The model "reads" the injection the same way it reads any other token.Validated on
Qwen3-1.7B-MLX-8bit (budget=50):
Qwen3.5-2B-OptiQ-4bit (budget=50, MoE architecture):
Both produce well-structured content after the injection. The
initial_thinking=Truefix handles chat templates that place<think>in the prompt.API
Test plan
python -m pytest tests/test_thinking_budget.py -vRelated: #914 (partial -- adds budget control), #1175 (MTP stale prev_tokens affects all stateful logits processors; fix suggested on #990)