Add speculative decoding by petrukha-ivan · Pull Request #173 · ml-explore/mlx-swift-lm

petrukha-ivan · 2026-03-31T15:14:15Z

Proposed changes

Add speculative decoding support, based on the approach in mlx-lm.

A new SpeculativeTokenIterator uses a smaller draft model to propose tokens that are verified in batch by the main model, yielding accepted tokens one at a time. Both generate() and generateTokens() get overloads accepting a draft model. This also extracts a TokenIteratorProtocol so the shared generation loop works with both iterator types.

Bug fix: trimPromptCache previously only trimmed the first cache layer. This PR fixes it to trim all layers, which is required for speculative decoding to correctly rewind rejected tokens.

Limitations

Speculative decoding requires trimmable KV caches to rewind rejected tokens. Models that use MambaCache are not supported and will throw an error. A possible solution would be to snapshot the cache state before drafting and restore it on rejection, but this adds memory overhead and complexity that felt out of scope for this PR.

Benchmarks

Benchmarked on M3 Max using a short text translation prompt (~150 tokens) generating ~130 tokens, with 2 draft tokens for speculative generation. Using more than 2 draft tokens makes results worse in most cases. The draft model generates further ahead and divergence from the main model lowers the acceptance rate, so the drafting overhead ends up making generation slower.

Main model	Draft model	Prompt tokens/s	Generation tokens/s	Memory peak
Qwen3-4B-4bit	-	788	99	2564M
Qwen3-4B-4bit	Qwen3-0.6B-4bit	708	128 (+29%)	2915M
Qwen3-14B-4bit	-	258	32	8237M
Qwen3-14B-4bit	Qwen3-0.6B-4bit	250	53 (+66%)	8351M
Qwen3-32B-4bit	-	71	14	17916M
Qwen3-32B-4bit	Qwen3-0.6B-4bit	73	25 (+79%)	18267M

The larger the gap between the main and draft model, the greater the benefit. The 32B model sees the biggest speedup (+79%) since its baseline generation is slow and there is more room for the draft model to help. Prompt processing speed stays largely unaffected, and the memory overhead from the 0.6B draft model is minimal in all cases (under 400MB).

Note: The benchmarks above use a translation task, which is relatively deterministic and tends to produce high acceptance rates. Results will vary depending on the draft model selection and the prompt. More creative tasks may see lower acceptance rates, and in some cases speculative decoding can even slow down generation.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

davidkoski · 2026-03-31T17:41:09Z

Libraries/MLXLMCommon/Evaluate.swift

+            processor?.didSample(token: token)
+            y = .init(tokens: token)
+            mainState = result.state
+            asyncEval(y.tokens)


I think the python code doesn't do this. It isn't incorrect, but it might collide with the prefill of the draft (below).

Good catch! We do batch processing for this token along with the drafted tokens in the verify step, so I think there's no reason to force evaluation here. Removed.

davidkoski · 2026-03-31T17:51:26Z

Libraries/MLXLMCommon/Evaluate.swift

+    parameters: GenerateParameters,
+    context: ModelContext,
+    draftModel: any LanguageModel,
+    draftCache: [KVCache]? = nil,


Does this need the main KVCache? or is there no need to keep that around?

Oh, it's actually something I missed. Added parameters for the main cache 👍

davidkoski · 2026-03-31T17:53:11Z

Libraries/MLXLMCommon/KVCache.swift

 @discardableResult
 public func trimPromptCache(_ cache: [KVCache], numTokens: Int) -> Int {
    guard canTrimPromptCache(cache), !cache.isEmpty else { return 0 }
+    cache.dropFirst().forEach { $0.trim(numTokens) }


This is curious -- perhaps we were not using cache trimming in the past but this looks important.

Yeah, I think it went unnoticed because this function was never used internally, but since it's public it's definitely important.

davidkoski

This looks good. I filed #181 to track adding support in ChatSession.

* Add speculative decoding

davidkoski reviewed Mar 31, 2026

View reviewed changes

petrukha-ivan added 4 commits April 2, 2026 11:36

Add speculative decoding

0cca9ea

Add main model cache parameter

11af792

Remove asyncEval during main model prefill

a56a279

Improve tests

9596574

petrukha-ivan force-pushed the speculative-decoding branch from 8507565 to 9596574 Compare April 2, 2026 14:15

davidkoski mentioned this pull request Apr 3, 2026

Enhancement: add speculative decoding to ChatSession #181

Open

davidkoski approved these changes Apr 3, 2026

View reviewed changes

davidkoski merged commit 8c9dd63 into ml-explore:main Apr 3, 2026
2 checks passed

jjang-ai pushed a commit to osaurus-ai/mlx-swift-lm that referenced this pull request Apr 3, 2026

Add speculative decoding (ml-explore#173)

7c82d21

* Add speculative decoding

jjang-ai pushed a commit to osaurus-ai/mlx-swift-lm that referenced this pull request Apr 3, 2026

Add speculative decoding (ml-explore#173)

354630d

* Add speculative decoding

VDurocher mentioned this pull request Apr 7, 2026

feat: expose speculative decoding in ChatSession (#181) #193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add speculative decoding#173

Add speculative decoding#173
davidkoski merged 4 commits intoml-explore:mainfrom
petrukha-ivan:speculative-decoding

petrukha-ivan commented Mar 31, 2026

Uh oh!

davidkoski Mar 31, 2026

Uh oh!

petrukha-ivan Apr 1, 2026

Uh oh!

davidkoski Mar 31, 2026

Uh oh!

petrukha-ivan Apr 1, 2026

Uh oh!

davidkoski Mar 31, 2026

Uh oh!

petrukha-ivan Apr 1, 2026

Uh oh!

davidkoski left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

petrukha-ivan commented Mar 31, 2026

Proposed changes

Limitations

Benchmarks

Checklist

Uh oh!

davidkoski Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

petrukha-ivan Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

davidkoski Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

petrukha-ivan Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

davidkoski Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

petrukha-ivan Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

davidkoski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants