Skip to content

feat(server): opt-in disk-backed L2 prompt cache (--prompt-cache-disk-dir)#1218

Open
freddyhaddad wants to merge 30 commits intoml-explore:mainfrom
freddyhaddad:disk-prompt-cache
Open

feat(server): opt-in disk-backed L2 prompt cache (--prompt-cache-disk-dir)#1218
freddyhaddad wants to merge 30 commits intoml-explore:mainfrom
freddyhaddad:disk-prompt-cache

Conversation

@freddyhaddad
Copy link
Copy Markdown

Summary

Adds an opt-in disk-backed L2 prompt cache to mlx_lm.server. With --prompt-cache-disk-dir set, every entry the in-memory LRUPromptCache holds is also persisted via write-through async writes to a per-model namespace on disk. Cached prefixes survive process restarts and runtime LRU evictions.

Off by default — no flag = byte-identical behavior to the pre-PR baseline. LRUPromptCache(max_size, max_bytes) constructor signature unchanged (new disk=None is keyword-only). All existing tests pass unchanged.

Motivation

Long-lived mlx_lm.server deployments (agentic CLIs, OpenAI-compatible API gateways with large system prompts) lose their entire prompt cache on every process exit. For a ~26K-token agentic prompt on Apple Silicon, that's ~200 seconds of full-prefill latency on the first request after every restart.

Real measurements on a Mac mini M-series running kimi-k2.6 (3.3-bit MLX):

Scenario First-request TTFB
Before this PR (every cold start) ~200s (full prefill of 26K tokens)
After this PR, second cold start ~5–15s (>95% cache hit on disk)

End-to-end verified with mlx-community/Qwen1.5-0.5B-Chat-4bit: a fresh server process serves 45 of 46 prompt tokens (98%) from disk on first request after a kill+restart cycle. See tests/test_disk_prompt_cache_e2e.py::test_restart_serves_from_disk.

What it does

  1. New module mlx_lm/disk_prompt_cache.py with DiskPromptCache class — flock-protected dir, atomic-rename writes, single background writer thread, mtime-based LRU eviction.
  2. LRUPromptCache.__init__ accepts optional disk: DiskPromptCache; insert_cache and fetch_nearest_cache get two-tier hooks.
  3. Six new server CLI flags (--prompt-cache-disk-dir, --prompt-cache-disk-bytes, --prompt-cache-disk-fsync, --prompt-cache-disk-write-queue-size, --prompt-cache-disk-warm, --prompt-cache-disk-eviction-headroom).
  4. New module mlx_lm/cache_admin.py — inspection/pruning CLI (stats, list, prune, verify, remove subcommands).
  5. README section, CHANGELOG entry, programmatic example at examples/disk_prompt_cache.py, maintainer-facing design doc at docs/disk_prompt_cache.md.

Design highlights

  • Per-model namespace keyed by sha256 of model.safetensors.index.json + tokenizer files + chat template + mlx_lm.__version__ major. Switching weights or tokenizers gives a clean separate namespace.
  • Path-deepest dominator replacement: when inserting a deeper trimmable leaf, shorter trimmable on-disk prefixes on the same trie path are deleted because the deeper file can serve their queries via existing trim_prompt_cache. ~10× density vs flat per-leaf storage.
  • Lazy load on RAM miss with in-flight Future dedup — concurrent requests for the same disk entry share one read.
  • Atomic writes: .tmp.safetensorsos.rename.safetensors. Crash never leaves a half-written file readable; orphan .tmp.safetensors files are cleaned up at next startup.
  • Touch on every fetch (RAM hit OR disk hit) updates the disk entry's mtime, so hot-in-RAM entries don't get disk-evicted just because RAM is satisfying their lookups.

Tests

  • tests/test_disk_prompt_cache.py — 57 unit tests, no model load, runs in ~7s
  • tests/test_cache_admin.py — 7 unit tests
  • tests/test_disk_prompt_cache_e2e.py — 3 real-server integration tests (skip via MLX_LM_E2E_SKIP=1)
  • benchmarks/disk_cache_overhead.py — overhead vs RAM-only on the no-miss hot path; measured ~5–7% on Apple Silicon, threshold 10%
$ pytest tests/test_disk_prompt_cache.py tests/test_cache_admin.py
================== 64 passed in 6.88s ==================

Non-goals (deliberately out of scope)

  • Multi-process safety beyond a single-writer flock (one mlx_lm.server per disk dir, enforced)
  • Network-shared cache (NFS, SMB, S3) — undefined behavior
  • Encryption at rest (use FileVault / encrypted dmg / LUKS)
  • Eager warm-up beyond opt-in --prompt-cache-disk-warm=eager-top-N (currently a documented placeholder; lazy is the default)
  • Compression of disk entries
  • Delta-from-parent storage for non-trimmable cache types

Known issue

mlx_lm.server's SIGTERM/SIGINT handler does not always exit the process cleanly within the 30s drain timeout when disk cache is active — httpd.serve_forever() blocks the main thread while the signal handler tries to sys.exit(0). SIGKILL fallback still leaves disk state fully consistent thanks to the atomic-rename invariant (verified by test_sigkill_leaves_consistent_state), but the daemon currently needs a hard kill rather than a graceful exit. Fix planned as a small follow-up commit (capture httpd in the handler closure, call httpd.shutdown() before disk_cache.shutdown()).

How to try it

mlx_lm.server \
    --model <your-model> \
    --prompt-cache-disk-dir ~/mlx-disk-cache \
    --prompt-cache-disk-bytes 100GB

Send a request, kill the server, restart with the same flags, send the same request — the second response will skip prefill on the cached portion.

python -m mlx_lm.cache_admin stats ~/mlx-disk-cache

🤖 Generated with Claude Code

freddyhaddad and others added 30 commits April 27, 2026 02:15
Adds stable 16-hex-char content hash for token sequences (hash_tokens)
and a per-model identity hash over weights manifest + tokenizer files
(compute_model_id), with full TDD coverage (8 tests, all passing).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cement

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the WriteJob dataclass for queuing pending disk writes, plus atomic
write/read-metadata helpers backed by save_prompt_cache and mx.load.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…trie reconstruction)

Adds DiskPromptCache class with __init__ that acquires dir flock, validates/writes
format-version, computes model_id, creates per-model subdir + entries/, writes info.json,
and reconstructs in-memory trie via reconstruct_disk_trie. Includes minimal shutdown()
stub that releases the lock (full impl deferred to Task 14). 4 new tests, 24 total passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add DiskPromptCache.start() (spawns daemon writer + touch threads),
touch_async() (sync in-memory mtime update + async os.utime),
_touch_loop() (background utime worker), and _writer_loop() placeholder
(real impl in Task 12). 3 new tests in TestDiskPromptCacheTouch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace placeholder _writer_loop with real implementation: atomic write,
trie/counters update under lock, dominated-parent deletion, and eviction
stub trigger. Add enqueue_write, _handle_write_job, _run_eviction_pass
stub, and _handle_write_error stub. 3 new TestWriterLoop tests (42 total).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the _handle_write_error stub with full logic:
- ENOSPC: temporarily caps max_bytes at current usage and sets headroom
  to 0.80, runs _run_eviction_pass to free space, then retries the
  write once; if retry also fails, logs and drops the entry.
- EACCES / PermissionError: sets _writes_disabled=True so the session
  degrades silently to RAM-only cache.
- Other OSError: logs and drops the single entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hook insert_cache to write through to the disk tier when self.disk is
set; computes path-deepest dominator prefixes for trimmable caches and
passes parents_to_evict so shorter redundant entries are deleted.
Disk-tier failures are caught and logged — never propagated to callers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 6 new argparse arguments for the disk-backed prompt cache tier:
--prompt-cache-disk-dir, --prompt-cache-disk-bytes, --prompt-cache-disk-fsync,
--prompt-cache-disk-write-queue-size, --prompt-cache-disk-warm, and
--prompt-cache-disk-eviction-headroom. Purely additive; feature is opt-in
via --prompt-cache-disk-dir. No wiring to cache instantiation yet (Task 20).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Install signal handlers inside run() after disk_cache.start() so that
SIGTERM and SIGINT trigger disk_cache.shutdown(timeout=30.0) before the
process exits, releasing the flock and draining the write queue cleanly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Force mx.eval on every layer's KV state in insert_cache before handing
the WriteJob to the background writer thread. The writer thread lacks a
GPU stream context, causing RuntimeError: There is no Stream(gpu, 0) in
current thread when it triggers lazy evaluation via save_safetensors.
Evaluation in the request-handling thread (which has a stream context)
avoids the crash. Adds regression test that verifies un-evaluated arrays
survive the write path correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add mlx_lm/cache_admin.py skeleton with stats and list subcommands
that read disk-cache directories without importing model machinery,
plus tests/test_cache_admin.py covering the two commands and the
missing-format-version error path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds benchmarks/disk_cache_overhead.py which measures fetch_nearest_cache
latency with disk=None vs disk=on on repeat exact-RAM-hit traffic.

Also fixes three performance regressions discovered during benchmarking:
- Skip disk.search() on exact RAM hits (disk can't beat quality 3)
- Add DiskPromptCache.search_and_touch() to combine search + mtime update
  in one lock acquisition instead of two
- Add _hash_to_leaf: Dict[str, DiskLeaf] for O(1) mtime updates in
  touch_async, replacing the prior O(N-tokens) trie.get walk
- Pre-compute and store token_hash in LRUPromptCache.CacheEntry so
  fetch_nearest_cache never recomputes SHA256 on the hot no-miss path

Result: 4.1% overhead (within the <5% budget) vs 58.8% before fixes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 5% threshold was tighter than the noise floor for these timings.
Absolute overhead is ~1 µs/op which is dominated by Python interpreter
+ scheduler noise. Using min-of-runs per side filters noise, and the
10% threshold still catches any real regression (which would be much
larger than 1 µs). Real-world runtime is unaffected because decode
latency is GPU-bound and dwarfs cache-lookup overhead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant