feat(server): opt-in disk-backed L2 prompt cache (--prompt-cache-disk-dir)#1218
Open
freddyhaddad wants to merge 30 commits intoml-explore:mainfrom
Open
feat(server): opt-in disk-backed L2 prompt cache (--prompt-cache-disk-dir)#1218freddyhaddad wants to merge 30 commits intoml-explore:mainfrom
freddyhaddad wants to merge 30 commits intoml-explore:mainfrom
Conversation
Adds stable 16-hex-char content hash for token sequences (hash_tokens) and a per-model identity hash over weights manifest + tokenizer files (compute_model_id), with full TDD coverage (8 tests, all passing). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cement Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the WriteJob dataclass for queuing pending disk writes, plus atomic write/read-metadata helpers backed by save_prompt_cache and mx.load. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…trie reconstruction) Adds DiskPromptCache class with __init__ that acquires dir flock, validates/writes format-version, computes model_id, creates per-model subdir + entries/, writes info.json, and reconstructs in-memory trie via reconstruct_disk_trie. Includes minimal shutdown() stub that releases the lock (full impl deferred to Task 14). 4 new tests, 24 total passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add DiskPromptCache.start() (spawns daemon writer + touch threads), touch_async() (sync in-memory mtime update + async os.utime), _touch_loop() (background utime worker), and _writer_loop() placeholder (real impl in Task 12). 3 new tests in TestDiskPromptCacheTouch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace placeholder _writer_loop with real implementation: atomic write, trie/counters update under lock, dominated-parent deletion, and eviction stub trigger. Add enqueue_write, _handle_write_job, _run_eviction_pass stub, and _handle_write_error stub. 3 new TestWriterLoop tests (42 total). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the _handle_write_error stub with full logic: - ENOSPC: temporarily caps max_bytes at current usage and sets headroom to 0.80, runs _run_eviction_pass to free space, then retries the write once; if retry also fails, logs and drops the entry. - EACCES / PermissionError: sets _writes_disabled=True so the session degrades silently to RAM-only cache. - Other OSError: logs and drops the single entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hook insert_cache to write through to the disk tier when self.disk is set; computes path-deepest dominator prefixes for trimmable caches and passes parents_to_evict so shorter redundant entries are deleted. Disk-tier failures are caught and logged — never propagated to callers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 6 new argparse arguments for the disk-backed prompt cache tier: --prompt-cache-disk-dir, --prompt-cache-disk-bytes, --prompt-cache-disk-fsync, --prompt-cache-disk-write-queue-size, --prompt-cache-disk-warm, and --prompt-cache-disk-eviction-headroom. Purely additive; feature is opt-in via --prompt-cache-disk-dir. No wiring to cache instantiation yet (Task 20). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Install signal handlers inside run() after disk_cache.start() so that SIGTERM and SIGINT trigger disk_cache.shutdown(timeout=30.0) before the process exits, releasing the flock and draining the write queue cleanly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Force mx.eval on every layer's KV state in insert_cache before handing the WriteJob to the background writer thread. The writer thread lacks a GPU stream context, causing RuntimeError: There is no Stream(gpu, 0) in current thread when it triggers lazy evaluation via save_safetensors. Evaluation in the request-handling thread (which has a stream context) avoids the crash. Adds regression test that verifies un-evaluated arrays survive the write path correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add mlx_lm/cache_admin.py skeleton with stats and list subcommands that read disk-cache directories without importing model machinery, plus tests/test_cache_admin.py covering the two commands and the missing-format-version error path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds benchmarks/disk_cache_overhead.py which measures fetch_nearest_cache latency with disk=None vs disk=on on repeat exact-RAM-hit traffic. Also fixes three performance regressions discovered during benchmarking: - Skip disk.search() on exact RAM hits (disk can't beat quality 3) - Add DiskPromptCache.search_and_touch() to combine search + mtime update in one lock acquisition instead of two - Add _hash_to_leaf: Dict[str, DiskLeaf] for O(1) mtime updates in touch_async, replacing the prior O(N-tokens) trie.get walk - Pre-compute and store token_hash in LRUPromptCache.CacheEntry so fetch_nearest_cache never recomputes SHA256 on the hot no-miss path Result: 4.1% overhead (within the <5% budget) vs 58.8% before fixes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 5% threshold was tighter than the noise floor for these timings. Absolute overhead is ~1 µs/op which is dominated by Python interpreter + scheduler noise. Using min-of-runs per side filters noise, and the 10% threshold still catches any real regression (which would be much larger than 1 µs). Real-world runtime is unaffected because decode latency is GPU-bound and dwarfs cache-lookup overhead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in disk-backed L2 prompt cache to
mlx_lm.server. With--prompt-cache-disk-dirset, every entry the in-memoryLRUPromptCacheholds is also persisted via write-through async writes to a per-model namespace on disk. Cached prefixes survive process restarts and runtime LRU evictions.Off by default — no flag = byte-identical behavior to the pre-PR baseline.
LRUPromptCache(max_size, max_bytes)constructor signature unchanged (newdisk=Noneis keyword-only). All existing tests pass unchanged.Motivation
Long-lived
mlx_lm.serverdeployments (agentic CLIs, OpenAI-compatible API gateways with large system prompts) lose their entire prompt cache on every process exit. For a ~26K-token agentic prompt on Apple Silicon, that's ~200 seconds of full-prefill latency on the first request after every restart.Real measurements on a Mac mini M-series running kimi-k2.6 (3.3-bit MLX):
End-to-end verified with
mlx-community/Qwen1.5-0.5B-Chat-4bit: a fresh server process serves 45 of 46 prompt tokens (98%) from disk on first request after a kill+restart cycle. Seetests/test_disk_prompt_cache_e2e.py::test_restart_serves_from_disk.What it does
mlx_lm/disk_prompt_cache.pywithDiskPromptCacheclass — flock-protected dir, atomic-rename writes, single background writer thread, mtime-based LRU eviction.LRUPromptCache.__init__accepts optionaldisk: DiskPromptCache;insert_cacheandfetch_nearest_cacheget two-tier hooks.--prompt-cache-disk-dir,--prompt-cache-disk-bytes,--prompt-cache-disk-fsync,--prompt-cache-disk-write-queue-size,--prompt-cache-disk-warm,--prompt-cache-disk-eviction-headroom).mlx_lm/cache_admin.py— inspection/pruning CLI (stats,list,prune,verify,removesubcommands).examples/disk_prompt_cache.py, maintainer-facing design doc atdocs/disk_prompt_cache.md.Design highlights
model.safetensors.index.json+ tokenizer files + chat template +mlx_lm.__version__major. Switching weights or tokenizers gives a clean separate namespace.trim_prompt_cache. ~10× density vs flat per-leaf storage.Futurededup — concurrent requests for the same disk entry share one read..tmp.safetensors→os.rename→.safetensors. Crash never leaves a half-written file readable; orphan.tmp.safetensorsfiles are cleaned up at next startup.Tests
tests/test_disk_prompt_cache.py— 57 unit tests, no model load, runs in ~7stests/test_cache_admin.py— 7 unit teststests/test_disk_prompt_cache_e2e.py— 3 real-server integration tests (skip viaMLX_LM_E2E_SKIP=1)benchmarks/disk_cache_overhead.py— overhead vs RAM-only on the no-miss hot path; measured ~5–7% on Apple Silicon, threshold 10%Non-goals (deliberately out of scope)
mlx_lm.serverper disk dir, enforced)--prompt-cache-disk-warm=eager-top-N(currently a documented placeholder; lazy is the default)Known issue
mlx_lm.server's SIGTERM/SIGINT handler does not always exit the process cleanly within the 30s drain timeout when disk cache is active —httpd.serve_forever()blocks the main thread while the signal handler tries tosys.exit(0). SIGKILL fallback still leaves disk state fully consistent thanks to the atomic-rename invariant (verified bytest_sigkill_leaves_consistent_state), but the daemon currently needs a hard kill rather than a graceful exit. Fix planned as a small follow-up commit (capturehttpdin the handler closure, callhttpd.shutdown()beforedisk_cache.shutdown()).How to try it
mlx_lm.server \ --model <your-model> \ --prompt-cache-disk-dir ~/mlx-disk-cache \ --prompt-cache-disk-bytes 100GBSend a request, kill the server, restart with the same flags, send the same request — the second response will skip prefill on the cached portion.
python -m mlx_lm.cache_admin stats ~/mlx-disk-cache🤖 Generated with Claude Code