feat: add --kv-bits CLI args for server KV cache quantization by lichengzhe · Pull Request #934 · ml-explore/mlx-lm

lichengzhe · 2026-02-27T06:35:27Z

Summary

Expose KV cache quantization parameters as server CLI arguments: --kv-bits, --kv-group-size, and --quantized-kv-start.

Motivation

generate_step() already supports KV cache quantization via kv_bits, kv_group_size, and quantized_kv_start parameters, and the CLI mlx_lm.generate exposes them. However, mlx_lm.server does not — users serving models via the OpenAI-compatible API have no way to enable KV cache quantization.

This is particularly useful for long-context serving scenarios where KV cache memory is the bottleneck. For example, serving a 35B MoE model with 262K context on a 128GB Mac can easily exhaust memory with multiple concurrent long-context requests. KV cache quantization (4-bit or 8-bit) significantly reduces memory per cache entry.

Changes

Added --kv-bits (choices: 4, 8), --kv-group-size (default: 64), --quantized-kv-start (default: 0) to the server argument parser
Pass these through to stream_generate() in _serve_single

Usage

mlx_lm.server --model my-model --kv-bits 4
mlx_lm.server --model my-model --kv-bits 8 --kv-group-size 128

Benchmark

Tested with Qwen3.5-35B-A3B (mxfp4) on Mac Mini M4 Pro 64GB:

Config	Aggregate t/s	Per-request avg t/s
Baseline (no quantization)	84.5	69.3
KV 4-bit	79.0	66.2

Speed is comparable (slight overhead expected). The main benefit is memory savings for long-context workloads.

Expose kv_bits, kv_group_size, and quantized_kv_start as server CLI arguments, passing them through to stream_generate. This allows users to trade off slight quality loss for significant memory savings when serving long-context requests. These parameters already exist in generate_step() but were not accessible from the server CLI.

dmitryryabkov mentioned this pull request Mar 4, 2026

Improve OOM handling and add memory controls for long-running coding agent sessions #948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --kv-bits CLI args for server KV cache quantization#934

feat: add --kv-bits CLI args for server KV cache quantization#934
lichengzhe wants to merge 1 commit intoml-explore:mainfrom
lichengzhe:feat/server-kv-cache-quantization

lichengzhe commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lichengzhe commented Feb 27, 2026

Summary

Motivation

Changes

Usage

Benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant