feat(server): opt-in disk-backed L2 prompt cache (--prompt-cache-disk-dir) by freddyhaddad · Pull Request #1218 · ml-explore/mlx-lm

freddyhaddad · 2026-04-27T20:50:39Z

Summary

Adds an opt-in disk-backed L2 prompt cache to mlx_lm.server. With --prompt-cache-disk-dir set, every entry the in-memory LRUPromptCache holds is also persisted via write-through async writes to a per-model namespace on disk. Cached prefixes survive process restarts and runtime LRU evictions.

Off by default — no flag = byte-identical behavior to the pre-PR baseline. LRUPromptCache(max_size, max_bytes) constructor signature unchanged (new disk=None is keyword-only). All existing tests pass unchanged.

Motivation

Long-lived mlx_lm.server deployments (agentic CLIs, OpenAI-compatible API gateways with large system prompts) lose their entire prompt cache on every process exit. For a ~26K-token agentic prompt on Apple Silicon, that's ~200 seconds of full-prefill latency on the first request after every restart.

Real measurements on a Mac mini M-series running kimi-k2.6 (3.3-bit MLX):

Scenario	First-request TTFB
Before this PR (every cold start)	~200s (full prefill of 26K tokens)
After this PR, second cold start	~5–15s (>95% cache hit on disk)

End-to-end verified with mlx-community/Qwen1.5-0.5B-Chat-4bit: a fresh server process serves 45 of 46 prompt tokens (98%) from disk on first request after a kill+restart cycle. See tests/test_disk_prompt_cache_e2e.py::test_restart_serves_from_disk.

What it does

New module mlx_lm/disk_prompt_cache.py with DiskPromptCache class — flock-protected dir, atomic-rename writes, single background writer thread, mtime-based LRU eviction.
LRUPromptCache.__init__ accepts optional disk: DiskPromptCache; insert_cache and fetch_nearest_cache get two-tier hooks.
Six new server CLI flags (--prompt-cache-disk-dir, --prompt-cache-disk-bytes, --prompt-cache-disk-fsync, --prompt-cache-disk-write-queue-size, --prompt-cache-disk-warm, --prompt-cache-disk-eviction-headroom).
New module mlx_lm/cache_admin.py — inspection/pruning CLI (stats, list, prune, verify, remove subcommands).
README section, CHANGELOG entry, programmatic example at examples/disk_prompt_cache.py, maintainer-facing design doc at docs/disk_prompt_cache.md.

Design highlights

Per-model namespace keyed by sha256 of model.safetensors.index.json + tokenizer files + chat template + mlx_lm.__version__ major. Switching weights or tokenizers gives a clean separate namespace.
Path-deepest dominator replacement: when inserting a deeper trimmable leaf, shorter trimmable on-disk prefixes on the same trie path are deleted because the deeper file can serve their queries via existing trim_prompt_cache. ~10× density vs flat per-leaf storage.
Lazy load on RAM miss with in-flight Future dedup — concurrent requests for the same disk entry share one read.
Atomic writes: .tmp.safetensors → os.rename → .safetensors. Crash never leaves a half-written file readable; orphan .tmp.safetensors files are cleaned up at next startup.
Touch on every fetch (RAM hit OR disk hit) updates the disk entry's mtime, so hot-in-RAM entries don't get disk-evicted just because RAM is satisfying their lookups.

Tests

tests/test_disk_prompt_cache.py — 57 unit tests, no model load, runs in ~7s
tests/test_cache_admin.py — 7 unit tests
tests/test_disk_prompt_cache_e2e.py — 3 real-server integration tests (skip via MLX_LM_E2E_SKIP=1)
benchmarks/disk_cache_overhead.py — overhead vs RAM-only on the no-miss hot path; measured ~5–7% on Apple Silicon, threshold 10%

$ pytest tests/test_disk_prompt_cache.py tests/test_cache_admin.py
================== 64 passed in 6.88s ==================

Non-goals (deliberately out of scope)

Multi-process safety beyond a single-writer flock (one mlx_lm.server per disk dir, enforced)
Network-shared cache (NFS, SMB, S3) — undefined behavior
Encryption at rest (use FileVault / encrypted dmg / LUKS)
Eager warm-up beyond opt-in --prompt-cache-disk-warm=eager-top-N (currently a documented placeholder; lazy is the default)
Compression of disk entries
Delta-from-parent storage for non-trimmable cache types

Known issue

mlx_lm.server's SIGTERM/SIGINT handler does not always exit the process cleanly within the 30s drain timeout when disk cache is active — httpd.serve_forever() blocks the main thread while the signal handler tries to sys.exit(0). SIGKILL fallback still leaves disk state fully consistent thanks to the atomic-rename invariant (verified by test_sigkill_leaves_consistent_state), but the daemon currently needs a hard kill rather than a graceful exit. Fix planned as a small follow-up commit (capture httpd in the handler closure, call httpd.shutdown() before disk_cache.shutdown()).

How to try it

mlx_lm.server \
    --model <your-model> \
    --prompt-cache-disk-dir ~/mlx-disk-cache \
    --prompt-cache-disk-bytes 100GB

Send a request, kill the server, restart with the same flags, send the same request — the second response will skip prefill on the cached portion.

python -m mlx_lm.cache_admin stats ~/mlx-disk-cache

🤖 Generated with Claude Code

Adds stable 16-hex-char content hash for token sequences (hash_tokens) and a per-model identity hash over weights manifest + tokenizer files (compute_model_id), with full TDD coverage (8 tests, all passing). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cement Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds the WriteJob dataclass for queuing pending disk writes, plus atomic write/read-metadata helpers backed by save_prompt_cache and mx.load. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>