Skip to content

feat: add thinking budget with early-stopping prompt injection#1196

Open
Thump604 wants to merge 4 commits intoml-explore:mainfrom
Thump604:feat/thinking-budget
Open

feat: add thinking budget with early-stopping prompt injection#1196
Thump604 wants to merge 4 commits intoml-explore:mainfrom
Thump604:feat/thinking-budget

Conversation

@Thump604
Copy link
Copy Markdown

Summary

Adds thinking_budget support using Qwen's recommended early-stopping prompt injection. When thinking tokens exceed the budget, the processor forces a transition prompt token-by-token through the model's forward pass, so the model reads the injection through full attention and transitions to content from a coherent hidden state.

  • make_thinking_budget_processor() in sample_utils.py -- stateful logits processor
  • build_early_stop_tokens() -- tokenizes transition message + think-end tokens
  • make_logits_processors() -- wired with thinking_budget params
  • Server accepts thinking_budget and thinking_budget_message from chat completion requests
  • initial_thinking detects when the prompt already opened a <think> block

Why this approach

The existing approaches in vLLM and llama.cpp force </think> via logit manipulation, but the model's hidden state never processes the transition -- it generates content from an incomplete thought state. Qwen's official recommendation for open-source inference is to inject a natural-language prompt that the model reads through attention, producing a coherent transition.

Each forced token goes through generate_step's normal _step() call, which runs the full model forward pass and updates the KV cache. The model "reads" the injection the same way it reads any other token.

Validated on

Qwen3-1.7B-MLX-8bit (budget=50):

  • Thinking: model reasons about Rayleigh scattering
  • Injection: "I need to provide my answer now based on my reasoning so far."
  • Content: 2051 chars, coherent physics explanation with headers and sections

Qwen3.5-2B-OptiQ-4bit (budget=50, MoE architecture):

  • Thinking: 249 chars of analysis and planning
  • Content: 1908 chars, structured explanation with numbered sections

Both produce well-structured content after the injection. The initial_thinking=True fix handles chat templates that place <think> in the prompt.

API

# Direct usage
from mlx_lm.sample_utils import make_thinking_budget_processor, build_early_stop_tokens

early_stop = build_early_stop_tokens(tokenizer, tokenizer.think_end_tokens)
processor = make_thinking_budget_processor(
    think_start_tokens=tokenizer.think_start_tokens,
    think_end_tokens=tokenizer.think_end_tokens,
    budget=100,
    early_stop_tokens=early_stop,
    initial_thinking=True,
)
result = generate(model, tokenizer, prompt, logits_processors=[processor])

# Via server (chat completion request body)
{"thinking_budget": 100, "thinking_budget_message": "Summarize and answer now."}

Test plan

  • 11 unit tests covering passthrough, budget enforcement, full injection sequence, post-injection reset, natural end-of-thinking, and assembly function
  • Integration on Qwen3 1.7B and Qwen3.5 2B
  • python -m pytest tests/test_thinking_budget.py -v

Related: #914 (partial -- adds budget control), #1175 (MTP stale prev_tokens affects all stateful logits processors; fix suggested on #990)

Adds make_thinking_budget_processor() to sample_utils.py. The stateful
logits processor tracks think_start/think_end tokens in the generated
sequence and, once the thinking token count reaches the budget, injects
early_stop_tokens one at a time by masking all other logits to -inf.
Includes 7 unit tests covering passthrough, budget enforcement, full
injection sequence, post-injection reset, and natural end-of-thinking.
Add thinking_budget, think_start_tokens, think_end_tokens, and
early_stop_tokens parameters to make_logits_processors(). When all four
are provided the function appends a thinking budget processor to the
list. Tests confirm processor is created when params are set and skipped
when they are None.
Add build_early_stop_tokens() to sample_utils for constructing the
injection token sequence from a tokenizer. Wire it into the server:
_make_logits_processors() now accepts a tokenizer and populates the
thinking_budget kwargs when the tokenizer has_thinking. Add
thinking_budget and thinking_budget_message fields to
GenerationArguments and parse them from the request body. Both the
batched and sequential generation paths pass tokenizer through.
Chat templates with enable_thinking=True add <think> to the prompt,
not the generated output. The processor never saw the opening tag
and never entered thinking state.

Add initial_thinking parameter so callers can signal that the prompt
already opened a thinking block. The server detects this by scanning
prompt tokens for think_start/think_end positions.
@AirRunner
Copy link
Copy Markdown

make_thinking_budget_processor was previously receiving stale context on MTP draft calls. prev_tokens is now fixed in #990.

Also, as already said there: the verify pass advances the processor state for bonus_tok even on rejection, and there's no way to roll it back without processors implementing an explicit save_state()/restore_state() protocol. The thinking budget counter would drift on rejections.

@Thump604
Copy link
Copy Markdown
Author

Noted on the drift. The thinking budget counter would over-count on rejections since the verify pass advances it for bonus_tok even when rejected. With the current interface, the only workaround is to make the processor stateless (recompute count from prev_tokens on each call) or accept the drift as bounded noise.

A save_state/restore_state protocol on the processor interface would fix this properly for all stateful processors. Will track separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants