Skip to content

Support KV cache quantization with continuous batching#941

Open
ochafik wants to merge 1 commit intoml-explore:mainfrom
ochafik:batch-kv-quantization
Open

Support KV cache quantization with continuous batching#941
ochafik wants to merge 1 commit intoml-explore:mainfrom
ochafik:batch-kv-quantization

Conversation

@ochafik
Copy link

@ochafik ochafik commented Mar 2, 2026

Summary

  • Add QuantizedKVCache.merge() — dequantizes to float, delegates to BatchKVCache.merge(), enabling quantized caches to participate in batch generation
  • Add quantize_config parameter to BatchKVCache.extract() — extracted caches can be re-quantized as QuantizedKVCache for memory-efficient LRU storage
  • Wire kv_bits/kv_group_size through BatchGenerator so callers can opt into quantized cache storage
  • Add QuantizedKVCache.size() for consistency with other cache types

Currently QuantizedKVCache and BatchGenerator are mutually exclusive (the is_batchable guard rejects kv_bits). This PR adds the missing merge() support so they can work together. The approach keeps batch computation in float (short-lived, needed for padding/concatenation) and only quantizes on extract (long-lived LRU storage), which is where memory savings matter most.

Test plan

  • 5 new tests covering merge, extract with quantization, roundtrip, empty cache merge, and size()
  • All 25 tests pass (tests/test_prompt_cache.py)
  • No changes to existing test behavior

🤖 Generated with Claude Code

Add QuantizedKVCache.merge() which dequantizes to float and delegates
to BatchKVCache.merge(), enabling quantized caches to participate in
batch generation.  Add quantize_config parameter to BatchKVCache.extract()
so extracted caches can be re-quantized for memory-efficient LRU storage.

Wire kv_bits/kv_group_size through BatchGenerator so callers can opt into
quantized cache storage while using float batch computation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@angeloskath
Copy link
Member

This is interesting albeit a bit weird.

Basically all the computation is done in fp16 but the storage is done quantized. Could be useful but generally speaking there should be a BatchQuantizedKVCache that also does the computation quantized a) so that perf doesn't degrade after every turn b) so that long context (largely memory bound) is faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants