feat: add thinking budget with early-stopping prompt injection by Thump604 · Pull Request #1196 · ml-explore/mlx-lm

Thump604 · 2026-04-25T04:25:42Z

Summary

Adds thinking_budget support using Qwen's recommended early-stopping prompt injection. When thinking tokens exceed the budget, the processor forces a transition prompt token-by-token through the model's forward pass, so the model reads the injection through full attention and transitions to content from a coherent hidden state.

make_thinking_budget_processor() in sample_utils.py -- stateful logits processor
build_early_stop_tokens() -- tokenizes transition message + think-end tokens
make_logits_processors() -- wired with thinking_budget params
Server accepts thinking_budget and thinking_budget_message from chat completion requests
initial_thinking detects when the prompt already opened a <think> block

Why this approach

The existing approaches in vLLM and llama.cpp force </think> via logit manipulation, but the model's hidden state never processes the transition -- it generates content from an incomplete thought state. Qwen's official recommendation for open-source inference is to inject a natural-language prompt that the model reads through attention, producing a coherent transition.

Each forced token goes through generate_step's normal _step() call, which runs the full model forward pass and updates the KV cache. The model "reads" the injection the same way it reads any other token.

Validated on

Qwen3-1.7B-MLX-8bit (budget=50):

Thinking: model reasons about Rayleigh scattering
Injection: "I need to provide my answer now based on my reasoning so far."
Content: 2051 chars, coherent physics explanation with headers and sections

Qwen3.5-2B-OptiQ-4bit (budget=50, MoE architecture):

Thinking: 249 chars of analysis and planning
Content: 1908 chars, structured explanation with numbered sections

Both produce well-structured content after the injection. The initial_thinking=True fix handles chat templates that place <think> in the prompt.

API

# Direct usage
from mlx_lm.sample_utils import make_thinking_budget_processor, build_early_stop_tokens

early_stop = build_early_stop_tokens(tokenizer, tokenizer.think_end_tokens)
processor = make_thinking_budget_processor(
    think_start_tokens=tokenizer.think_start_tokens,
    think_end_tokens=tokenizer.think_end_tokens,
    budget=100,
    early_stop_tokens=early_stop,
    initial_thinking=True,
)
result = generate(model, tokenizer, prompt, logits_processors=[processor])

# Via server (chat completion request body)
{"thinking_budget": 100, "thinking_budget_message": "Summarize and answer now."}

Test plan

11 unit tests covering passthrough, budget enforcement, full injection sequence, post-injection reset, natural end-of-thinking, and assembly function
Integration on Qwen3 1.7B and Qwen3.5 2B
python -m pytest tests/test_thinking_budget.py -v

Related: #914 (partial -- adds budget control), #1175 (MTP stale prev_tokens affects all stateful logits processors; fix suggested on #990)

Adds make_thinking_budget_processor() to sample_utils.py. The stateful logits processor tracks think_start/think_end tokens in the generated sequence and, once the thinking token count reaches the budget, injects early_stop_tokens one at a time by masking all other logits to -inf. Includes 7 unit tests covering passthrough, budget enforcement, full injection sequence, post-injection reset, and natural end-of-thinking.

Add thinking_budget, think_start_tokens, think_end_tokens, and early_stop_tokens parameters to make_logits_processors(). When all four are provided the function appends a thinking budget processor to the list. Tests confirm processor is created when params are set and skipped when they are None.

Add build_early_stop_tokens() to sample_utils for constructing the injection token sequence from a tokenizer. Wire it into the server: _make_logits_processors() now accepts a tokenizer and populates the thinking_budget kwargs when the tokenizer has_thinking. Add thinking_budget and thinking_budget_message fields to GenerationArguments and parse them from the request body. Both the batched and sequential generation paths pass tokenizer through.

Chat templates with enable_thinking=True add <think> to the prompt, not the generated output. The processor never saw the opening tag and never entered thinking state. Add initial_thinking parameter so callers can signal that the prompt already opened a thinking block. The server detects this by scanning prompt tokens for think_start/think_end positions.

AirRunner · 2026-04-25T17:28:24Z

make_thinking_budget_processor was previously receiving stale context on MTP draft calls. prev_tokens is now fixed in #990.

Also, as already said there: the verify pass advances the processor state for bonus_tok even on rejection, and there's no way to roll it back without processors implementing an explicit save_state()/restore_state() protocol. The thinking budget counter would drift on rejections.

Thump604 · 2026-04-25T17:56:08Z

Noted on the drift. The thinking budget counter would over-count on rejections since the verify pass advances it for bonus_tok even when rejected. With the current interface, the only workaround is to make the processor stateless (recompute count from prev_tokens on each call) or accept the drift as bounded noise.

A save_state/restore_state protocol on the processor interface would fix this properly for all stateful processors. Will track separately.

Thump604 added 4 commits April 24, 2026 22:48

AirRunner mentioned this pull request Apr 25, 2026

feat: native MTP speculative decoding for qwen3_5_moe (Qwen3.5-3.6) #990

Open

Thump604 mentioned this pull request Apr 25, 2026

mtp_generate_step: logits processors see stale prev_tokens on draft calls #1175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add thinking budget with early-stopping prompt injection#1196

feat: add thinking budget with early-stopping prompt injection#1196
Thump604 wants to merge 4 commits intoml-explore:mainfrom
Thump604:feat/thinking-budget

Thump604 commented Apr 25, 2026

Uh oh!

AirRunner commented Apr 25, 2026

Uh oh!

Thump604 commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Thump604 commented Apr 25, 2026

Summary

Why this approach

Validated on

API

Test plan

Uh oh!

AirRunner commented Apr 25, 2026

Uh oh!

Thump604 commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants