Support KV cache quantization with continuous batching by ochafik · Pull Request #941 · ml-explore/mlx-lm

ochafik · 2026-03-02T01:33:43Z

Summary

Add QuantizedKVCache.merge() — dequantizes to float, delegates to BatchKVCache.merge(), enabling quantized caches to participate in batch generation
Add quantize_config parameter to BatchKVCache.extract() — extracted caches can be re-quantized as QuantizedKVCache for memory-efficient LRU storage
Wire kv_bits/kv_group_size through BatchGenerator so callers can opt into quantized cache storage
Add QuantizedKVCache.size() for consistency with other cache types

Currently QuantizedKVCache and BatchGenerator are mutually exclusive (the is_batchable guard rejects kv_bits). This PR adds the missing merge() support so they can work together. The approach keeps batch computation in float (short-lived, needed for padding/concatenation) and only quantizes on extract (long-lived LRU storage), which is where memory savings matter most.

Test plan

5 new tests covering merge, extract with quantization, roundtrip, empty cache merge, and size()
All 25 tests pass (tests/test_prompt_cache.py)
No changes to existing test behavior

🤖 Generated with Claude Code

Add QuantizedKVCache.merge() which dequantizes to float and delegates to BatchKVCache.merge(), enabling quantized caches to participate in batch generation. Add quantize_config parameter to BatchKVCache.extract() so extracted caches can be re-quantized for memory-efficient LRU storage. Wire kv_bits/kv_group_size through BatchGenerator so callers can opt into quantized cache storage while using float batch computation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

angeloskath · 2026-03-04T01:59:54Z

This is interesting albeit a bit weird.

Basically all the computation is done in fp16 but the storage is done quantized. Could be useful but generally speaking there should be a BatchQuantizedKVCache that also does the computation quantized a) so that perf doesn't degrade after every turn b) so that long context (largely memory bound) is faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support KV cache quantization with continuous batching#941

Support KV cache quantization with continuous batching#941
ochafik wants to merge 1 commit intoml-explore:mainfrom
ochafik:batch-kv-quantization

ochafik commented Mar 2, 2026

Uh oh!

angeloskath commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ochafik commented Mar 2, 2026

Summary

Test plan

Uh oh!

angeloskath commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants