Improve OOM handling and add memory controls for long-running coding agent sessions by dmitryryabkov · Pull Request #948 · ml-explore/mlx-lm

dmitryryabkov · 2026-03-04T00:03:57Z

This PR improves mlx_lm.server stability under Metal memory pressure by adding memory limit controls and improves failure behavior for long-context, multi-turn workloads (like coding agents).

Problem

In long-running chat sessions, active KV memory can grow until Metal OOM is hit. In practice this led to:

process aborts (kIOGPUCommandBufferCallbackErrorOutOfMemory)
unstable behavior near memory limits
poor user experience for tuning memory safety vs quality

The existing prompt cache size controls (--prompt-cache-bytes) help, but do not fully address active in-flight KV growth.

The problem was most obvious when using coding agents with mlx_lm.server.

Improvements in this PR

1) Prompt context limit (primary unbounded memory growth control)

Add --max-prompt-tokens
Add --prompt-overflow-policy with:
- error (reject oversized prompts)
- truncate
Add --prompt-keep-tokens for beginning of the prompt token retention in truncate mode

This is the key mechanism to stop memory growth over multiple interactions.

2) Active memory limit controls

Add --max-active-kv-bytes (admission control based on projected KV growth)
Add --max-active-memory-bytes (runtime guard using mx.get_active_memory() during prefill/progress)
Add --prefill-step-size to reduce prefill memory spikes by processing prompt in smaller chunks

3) Fixed active KV window option

Add --max-kv-size in server path (single and batched paths) to enable rotating-cache bounded working set where needed.
Document potential quality tradeoff.

4) Better OOM failure handling

Detect Metal OOM-like failures and map them to HTTP 503 instead of generic 404.
Use 500 for non-OOM server-side failures.
Harden batched generation failure propagation so failures are surfaced cleanly per request.

5) Documentation

SERVER.md updated with:

all server memory-control flags in one section
explanation of how --max-active-memory-bytes, --max-active-kv-bytes, and --max-kv-size interact
practical sizing order and example values (tested on my 36GB machine, the only environment I have at the moment)

Relationship to other in-flight / recent PRs

Complements cache limiting work from:
- Improve the cache size limits #906 (cache size limits)
- Make the cache limits more friendly #910 (friendlier cache limit UX)
Directly addresses the issues highlighted in:
- mlx_lm.server crashes on Metal OOM instead of returning an HTTP error #854 (server crashes on Metal OOM)
Designed to work with future kv-quantization server work (e.g. feat: add --kv-bits CLI args for server KV cache quantization #934), but keeps that separate due to current model/backend compatibility complexity.

dmitryryabkov added 15 commits February 28, 2026 21:32

server: return 503 on Metal OOM and harden generation failures

cf6ed13

server: add KV limits, admission control, and quantization wiring

dbec8be

docs: add mlx_lm.server memory control options and examples

901af37

cache: import tree_reduce used by quantized cache nbytes

16a6a6e

cache: make quantized nbytes robust for empty state

745f14d

server: reject kv_bits for unsupported MLA-style model caches

c0620fb

server: enforce active memory ceiling during prefill

cfd11ac

server: add max prompt token cap with truncate/reject policies

0130d3b

server: drop KV quantization scope from memory-safety PR

e7fd8d5

docs: explain server memory limits and sizing guidance

d3a63c6

server: honor --prompt-cache-bytes 0 without fallback

28f1f5b

docs: final polish

90be0f3

server: remove unnecessary code

07853af

server: remove unnecessary test code

78ab502

Merge branch 'main' into fix-server-oom-handling-kv-bounds

2fc32b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve OOM handling and add memory controls for long-running coding agent sessions#948

Improve OOM handling and add memory controls for long-running coding agent sessions#948
dmitryryabkov wants to merge 15 commits intoml-explore:mainfrom
dmitryryabkov:fix-server-oom-handling-kv-bounds

dmitryryabkov commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dmitryryabkov commented Mar 4, 2026

Problem

Improvements in this PR

1) Prompt context limit (primary unbounded memory growth control)

2) Active memory limit controls

3) Fixed active KV window option

4) Better OOM failure handling

5) Documentation

Relationship to other in-flight / recent PRs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant