Improve OOM handling and add memory controls for long-running coding agent sessions#948
Open
dmitryryabkov wants to merge 15 commits intoml-explore:mainfrom
Open
Improve OOM handling and add memory controls for long-running coding agent sessions#948dmitryryabkov wants to merge 15 commits intoml-explore:mainfrom
dmitryryabkov wants to merge 15 commits intoml-explore:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR improves mlx_lm.server stability under Metal memory pressure by adding memory limit controls and improves failure behavior for long-context, multi-turn workloads (like coding agents).
Problem
In long-running chat sessions, active KV memory can grow until Metal OOM is hit. In practice this led to:
The existing prompt cache size controls (
--prompt-cache-bytes) help, but do not fully address active in-flight KV growth.The problem was most obvious when using coding agents with mlx_lm.server.
Improvements in this PR
1) Prompt context limit (primary unbounded memory growth control)
--max-prompt-tokens--prompt-overflow-policywith:error(reject oversized prompts)truncate--prompt-keep-tokensfor beginning of the prompt token retention intruncatemodeThis is the key mechanism to stop memory growth over multiple interactions.
2) Active memory limit controls
--max-active-kv-bytes(admission control based on projected KV growth)--max-active-memory-bytes(runtime guard usingmx.get_active_memory()during prefill/progress)--prefill-step-sizeto reduce prefill memory spikes by processing prompt in smaller chunks3) Fixed active KV window option
--max-kv-sizein server path (single and batched paths) to enable rotating-cache bounded working set where needed.4) Better OOM failure handling
5) Documentation
SERVER.md updated with:
--max-active-memory-bytes,--max-active-kv-bytes, and--max-kv-sizeinteractRelationship to other in-flight / recent PRs