Add --q-override for per-layer quantization by spicyneuron · Pull Request #922 · ml-explore/mlx-lm

spicyneuron · 2026-02-23T04:47:19Z

Adds a repeatable --q-override PATTERN=VALUE flag to override quantization on a per-layer basis. Patterns are regexes matched against module paths, where the first match wins.

Similar to (and inspired by) llama-quantize. I was applying this via a custom script but felt it might be a helpful option to upstream.

Values can be:

a bit width (e.g. 8)
a quant mode (mxfp4, nvfp4, mxfp8)
a float dtype (float16, bfloat16, float32)

Example:

mlx_lm.convert \
  --quantize \
  --q-bits 4 \
  --override "lm_head=6" \
  --override "embed_tokens=4" \
  --override "layers\.(0|1|2|4|5|6|8|9|10|12|13|14|16|17|18|20|21|22|24|25|26|28|29|30|32|33|34|36|37|38|40|41|42|44)\.self_attn\.q_proj=mxfp4" \
  --override "layers\.(45|46)\.self_attn\.q_proj=5" \
  --override "layers\.(0|1|2|4|5|6|8|9|10|12|13|14|16|17|18|20|21|22|24|25|26|28|29|30|32|33|34|36|37|38|40|41|42|44)\.linear_attn\.in_proj_qkvz=mxfp4" \
  --override "layers\.(45|46)\.linear_attn\.in_proj_qkvz=5" \
  --override "layers\.(0|2|3|4|5|7|8|9|11|12|13|15|16|17|18|19|20|21|23|24|25|26|27|28|29|30|32|33|34|35|36|37|38|40|41|44)\.mlp\.switch_mlp\.down_proj=mxfp4" \
  --override "layers\.(1|6|10|14|22|31|39|42|43|45|46|47)\.mlp\.switch_mlp\.down_proj=6" \
  --override "layers\.(0|2|3|4|5|7|8|9|11|12|13|15|16|17|18|19|21|23|24|25|26|27|28|29|30|33|34|35|36|37|38|40|41|44)\.mlp\.shared_experts\.down_proj=6" \
  --override "layers\.(1|6|10|14|20|22|31|32|39|42|43|45|46|47)\.mlp\.shared_experts\.down_proj=8" \
  --override "layers\.(\d+)\.mlp\.switch_mlp\.gate_proj=mxfp4" \
  --override "layers\.(\d+)\.mlp\.gate=float32" \
  --override "layers\.(\d+)\.mlp\.shared_experts\.gate_proj=mxfp4" \
  --override "layers\.(\d+)\.mlp\.switch_mlp\.up_proj=mxfp4" \
  --override "layers\.(\d+)\.mlp\.shared_experts\.up_proj=mxfp4" \
  --override "layers\.(\d+)\.linear_attn\.in_proj_ba=4" \
  --override "layers\.(\d+)\.linear_attn\.conv1d=float32" \
  --override "layers\.(0|1|2|4|5|6|8|9|10|13|14|18|20|21|22|24|25|26|28|29|33|36|37|38|40|41|42)\.linear_attn\.out_proj=5" \
  --override "layers\.(12|16|17|30|32|34|44)\.linear_attn\.out_proj=6" \
  --override "layers\.(45|46)\.linear_attn\.out_proj=8" \
  --override "layers\.(3|7|11|15|19|23|27|31|35|39)\.self_attn\.k_proj=mxfp4" \
  --override "layers\.(43|47)\.self_attn\.k_proj=5" \
  --override "layers\.(3|7|11|15|19|23|27|31|35|39)\.self_attn\.o_proj=mxfp4" \
  --override "layers\.(43|47)\.self_attn\.o_proj=5" \
  --override "layers\.(3|7|11|15|19|23|27|31|35|39)\.self_attn\.q_proj=mxfp4" \
  --override "layers\.(43|47)\.self_attn\.q_proj=5" \
  --override "layers\.(3|7|11|15|19|23|27|31|35|39)\.self_attn\.v_proj=mxfp4" \
  --override "layers\.(43|47)\.self_attn\.v_proj=5" \
  --hf-path Qwen/Qwen3-Coder-Next \
  --mlx-path ....

Also fixes a subtle issue where convert() hardcoded q_group_size=64 and q_bits=4 as signature defaults. The CLI was fine (argparse passesNone), but calling convert() from Python with q_mode="mxfp4" would silently use the wrong group size. Now both paths go through the same mode-aware default logic.

Adds unit tests for the three new helpers (parse_overrides, build_override_predicate, apply_float_overrides).

ivanfioravanti · 2026-02-23T13:53:56Z

This can be interesting to create custom quantizations 🤔

spicyneuron · 2026-02-25T03:21:40Z

@ivanfioravanti Exactly! Nothing groundbreaking from the example above, but it's a small perplexity improvement with a small cost in speed: https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-mixed-4.5-bit

I saw a bigger difference in more aggressive quants, where uniform 3-bit or mixed_2_4 isn't stable, but a mixed quant is: https://huggingface.co/spicyneuron/Kimi-K2.5-MLX-mixed-2.8-bit

spicyneuron · 2026-02-27T08:35:39Z

Discovered some edge cases while inspecting quant results. Moved to draft until I'm confident those are settled.

change

spicyneuron · 2026-03-01T04:45:44Z

The issue was the convert() bug I mentioned was only partially addressed here. I've extracted the fix into a separate PR so it's easier to review: #935

Add --q-override for per-layer quantization overrides

cc25080

spicyneuron added 7 commits February 27, 2026 13:44

Explain quantization skip behavior

2c942b7

Add warning for conflicting q-mode

129d087

Fix mixed mode override warning

439dfd8

Review quantization override issues

bc94613

Clarify legacy behavior choices

3d8684b

Improve quantize_model gating and per-layer config handling

66b8519

Unify convert quant defaults and tighten conflict warning coverage

3eaa8f3

spicyneuron mentioned this pull request Feb 27, 2026

fix: convert() uses incorrect defaults for quantization mode #935

Merged

Clarify mlx_lm convert group-size

dfbbe8f

spicyneuron marked this pull request as draft February 27, 2026 08:33

spicyneuron added 4 commits March 1, 2026 08:24

Add test for override mode group size

d5d512d

Inline defaults_for_mode

ebad2c6

Fix convert() to use mode-aware quantization defaults, remove warnings

0b39ca5

change

Merge branch 'quantize-config-fix' into override

e6f3299

spicyneuron marked this pull request as ready for review March 1, 2026 04:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --q-override for per-layer quantization#922

Add --q-override for per-layer quantization#922
spicyneuron wants to merge 13 commits intoml-explore:mainfrom
spicyneuron:override

spicyneuron commented Feb 23, 2026

Uh oh!

ivanfioravanti commented Feb 23, 2026

Uh oh!

spicyneuron commented Feb 25, 2026

Uh oh!

spicyneuron commented Feb 27, 2026

Uh oh!

spicyneuron commented Mar 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spicyneuron commented Feb 23, 2026

Uh oh!

ivanfioravanti commented Feb 23, 2026

Uh oh!

spicyneuron commented Feb 25, 2026

Uh oh!

spicyneuron commented Feb 27, 2026

Uh oh!

spicyneuron commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spicyneuron commented Mar 1, 2026 •

edited

Loading