Hybrid cache for Qwen3.5 by dexloom · Pull Request #923 · ml-explore/mlx-lm

dexloom · 2026-02-23T07:18:30Z

Summary

This pull request includes new features, refactoring, and fixes to enhance the functionality of LRUPromptCache, particularly for branching conversation histories and prompt chaining.

Changes Made

Checkpointing: Periodic snapshots added to preserve cache states for branching use cases.
Boundary Caching: Cache entries can now be stored at message boundaries for shared cache usage.
In-Place Updates: LRU strategy refactored to reduce memory overhead with mutable cache objects.
Cache Persistence: Fixes to ensure reused prefix prompts remain in the cache for hybrid models.
Cache Key Separation: Improved handling of cache keys for accurate prompt chaining.

Checklist

Unit tests have been added/updated
Documentation has been updated (if applicable)
Changes match coding standards and guidelines

Modify server tokenization logic to distinguish between the prompt used for generation and the key used for cache lookup. - Add `_tokenize_for_cache_key` which applies chat templates without the generation suffix. This ensures KV cache lookups match on message content only, fixing prompt chaining issues where the suffix incorrectly altered the cache key. - Update batch and non-batch generation flows to fetch cache using the clean key, then append the generation suffix to the remaining tokens. - Fix `ArraysCache.nbytes` to check for `None` entries before summing bytes, preventing potential errors during size calculation.

Previously, extracting a cache entry removed it from the LRU, preventing multiple requests from reusing the same cached prefix. Update `LRUPromptCache._extract` to accept a `keep_original` flag. When enabled for shorter prefix matches, the method returns a deep copy of the cache without deleting the original entry. This ensures the cached prompt remains available for subsequent requests, supporting hybrid model prompt chaining. Add `test_cache_persistence` to verify that cached prefixes persist and are reused across multiple requests.

Update LRUPromptCache to store cache entries at message boundaries (e.g., after system or user messages) in addition to the full prompt sequence. This allows the cache to be shared when conversations branch, improving efficiency for multi-turn dialogs. - Modify `insert_cache` to accept optional `boundary_positions` list. - Add `_insert_boundary_cache` helper to store references to shared cache objects at specific token indices. - Add `_find_cache_boundaries` in `ResponseGenerator` to detect message delimiters like `<|im_end|>` across different tokenizers.

Modify the LRU cache strategy to return references instead of copies, reducing memory overhead for hybrid models and prompt chaining. - Remove `deepcopy` in `_extract` to allow cache objects to be mutated in place. - Update `fetch_nearest_cache` to return the matched token position, enabling cache migration. - Extend `insert_cache` with `old_position` to move cache entries rather than duplicating them. - Dynamically update `nbytes` when overwriting existing cache entries. - Add debug logging for cache operations.

Adds support for creating periodic snapshots of the prompt cache to facilitate branching conversation histories. - Introduced `is_snapshot` attribute to `CacheEntry` to distinguish mutable cache entries from immutable snapshots. - Added `checkpoint_interval` (default 8192) to `__init__` to specify snapshot frequency. - Implemented `_find_checkpoint_positions` to place snapshots at logical message boundaries near the interval. - Modified lookup logic to extract copies from snapshots (preventing shared state mutation) while preserving in-place updates for linear extensions.

dexloom added 5 commits February 22, 2026 06:02

dexloom mentioned this pull request Feb 23, 2026

Caching doesn't seem to be working for Qwen3.5 #903

Open

omui-ai mentioned this pull request Feb 25, 2026

Fix hybrid cache checkpoints for short conversations #929

Draft

jscheel42 added a commit to jscheel42/mlx-tools that referenced this pull request Mar 3, 2026

add patch from ml-explore/mlx-lm#923

ff829d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid cache for Qwen3.5#923

Hybrid cache for Qwen3.5#923
dexloom wants to merge 5 commits intoml-explore:mainfrom
dexloom:hybrid_cache

dexloom commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexloom commented Feb 23, 2026

Summary

Changes Made

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant