Skip to content

Improve OOM handling and add memory controls for long-running coding agent sessions#948

Open
dmitryryabkov wants to merge 15 commits intoml-explore:mainfrom
dmitryryabkov:fix-server-oom-handling-kv-bounds
Open

Improve OOM handling and add memory controls for long-running coding agent sessions#948
dmitryryabkov wants to merge 15 commits intoml-explore:mainfrom
dmitryryabkov:fix-server-oom-handling-kv-bounds

Conversation

@dmitryryabkov
Copy link

This PR improves mlx_lm.server stability under Metal memory pressure by adding memory limit controls and improves failure behavior for long-context, multi-turn workloads (like coding agents).

Problem

In long-running chat sessions, active KV memory can grow until Metal OOM is hit. In practice this led to:

  • process aborts (kIOGPUCommandBufferCallbackErrorOutOfMemory)
  • unstable behavior near memory limits
  • poor user experience for tuning memory safety vs quality

The existing prompt cache size controls (--prompt-cache-bytes) help, but do not fully address active in-flight KV growth.

The problem was most obvious when using coding agents with mlx_lm.server.

Improvements in this PR

1) Prompt context limit (primary unbounded memory growth control)

  • Add --max-prompt-tokens
  • Add --prompt-overflow-policy with:
    • error (reject oversized prompts)
    • truncate
  • Add --prompt-keep-tokens for beginning of the prompt token retention in truncate mode

This is the key mechanism to stop memory growth over multiple interactions.

2) Active memory limit controls

  • Add --max-active-kv-bytes (admission control based on projected KV growth)
  • Add --max-active-memory-bytes (runtime guard using mx.get_active_memory() during prefill/progress)
  • Add --prefill-step-size to reduce prefill memory spikes by processing prompt in smaller chunks

3) Fixed active KV window option

  • Add --max-kv-size in server path (single and batched paths) to enable rotating-cache bounded working set where needed.
  • Document potential quality tradeoff.

4) Better OOM failure handling

  • Detect Metal OOM-like failures and map them to HTTP 503 instead of generic 404.
  • Use 500 for non-OOM server-side failures.
  • Harden batched generation failure propagation so failures are surfaced cleanly per request.

5) Documentation

SERVER.md updated with:

  • all server memory-control flags in one section
  • explanation of how --max-active-memory-bytes, --max-active-kv-bytes, and --max-kv-size interact
  • practical sizing order and example values (tested on my 36GB machine, the only environment I have at the moment)

Relationship to other in-flight / recent PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant