Skip to content

feat: add --kv-bits CLI args for server KV cache quantization#934

Open
lichengzhe wants to merge 1 commit intoml-explore:mainfrom
lichengzhe:feat/server-kv-cache-quantization
Open

feat: add --kv-bits CLI args for server KV cache quantization#934
lichengzhe wants to merge 1 commit intoml-explore:mainfrom
lichengzhe:feat/server-kv-cache-quantization

Conversation

@lichengzhe
Copy link

Summary

Expose KV cache quantization parameters as server CLI arguments: --kv-bits, --kv-group-size, and --quantized-kv-start.

Motivation

generate_step() already supports KV cache quantization via kv_bits, kv_group_size, and quantized_kv_start parameters, and the CLI mlx_lm.generate exposes them. However, mlx_lm.server does not — users serving models via the OpenAI-compatible API have no way to enable KV cache quantization.

This is particularly useful for long-context serving scenarios where KV cache memory is the bottleneck. For example, serving a 35B MoE model with 262K context on a 128GB Mac can easily exhaust memory with multiple concurrent long-context requests. KV cache quantization (4-bit or 8-bit) significantly reduces memory per cache entry.

Changes

  • Added --kv-bits (choices: 4, 8), --kv-group-size (default: 64), --quantized-kv-start (default: 0) to the server argument parser
  • Pass these through to stream_generate() in _serve_single

Usage

mlx_lm.server --model my-model --kv-bits 4
mlx_lm.server --model my-model --kv-bits 8 --kv-group-size 128

Benchmark

Tested with Qwen3.5-35B-A3B (mxfp4) on Mac Mini M4 Pro 64GB:

Config Aggregate t/s Per-request avg t/s
Baseline (no quantization) 84.5 69.3
KV 4-bit 79.0 66.2

Speed is comparable (slight overhead expected). The main benefit is memory savings for long-context workloads.

Expose kv_bits, kv_group_size, and quantized_kv_start as server CLI
arguments, passing them through to stream_generate. This allows users
to trade off slight quality loss for significant memory savings when
serving long-context requests.

These parameters already exist in generate_step() but were not
accessible from the server CLI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant