Skip to content

Add --q-override for per-layer quantization#922

Open
spicyneuron wants to merge 13 commits intoml-explore:mainfrom
spicyneuron:override
Open

Add --q-override for per-layer quantization#922
spicyneuron wants to merge 13 commits intoml-explore:mainfrom
spicyneuron:override

Conversation

@spicyneuron
Copy link
Contributor

Adds a repeatable --q-override PATTERN=VALUE flag to override quantization on a per-layer basis. Patterns are regexes matched against module paths, where the first match wins.

Similar to (and inspired by) llama-quantize. I was applying this via a custom script but felt it might be a helpful option to upstream.

Values can be:

  • a bit width (e.g. 8)
  • a quant mode (mxfp4, nvfp4, mxfp8)
  • a float dtype (float16, bfloat16, float32)

Example:

mlx_lm.convert \
  --quantize \
  --q-bits 4 \
  --override "lm_head=6" \
  --override "embed_tokens=4" \
  --override "layers\.(0|1|2|4|5|6|8|9|10|12|13|14|16|17|18|20|21|22|24|25|26|28|29|30|32|33|34|36|37|38|40|41|42|44)\.self_attn\.q_proj=mxfp4" \
  --override "layers\.(45|46)\.self_attn\.q_proj=5" \
  --override "layers\.(0|1|2|4|5|6|8|9|10|12|13|14|16|17|18|20|21|22|24|25|26|28|29|30|32|33|34|36|37|38|40|41|42|44)\.linear_attn\.in_proj_qkvz=mxfp4" \
  --override "layers\.(45|46)\.linear_attn\.in_proj_qkvz=5" \
  --override "layers\.(0|2|3|4|5|7|8|9|11|12|13|15|16|17|18|19|20|21|23|24|25|26|27|28|29|30|32|33|34|35|36|37|38|40|41|44)\.mlp\.switch_mlp\.down_proj=mxfp4" \
  --override "layers\.(1|6|10|14|22|31|39|42|43|45|46|47)\.mlp\.switch_mlp\.down_proj=6" \
  --override "layers\.(0|2|3|4|5|7|8|9|11|12|13|15|16|17|18|19|21|23|24|25|26|27|28|29|30|33|34|35|36|37|38|40|41|44)\.mlp\.shared_experts\.down_proj=6" \
  --override "layers\.(1|6|10|14|20|22|31|32|39|42|43|45|46|47)\.mlp\.shared_experts\.down_proj=8" \
  --override "layers\.(\d+)\.mlp\.switch_mlp\.gate_proj=mxfp4" \
  --override "layers\.(\d+)\.mlp\.gate=float32" \
  --override "layers\.(\d+)\.mlp\.shared_experts\.gate_proj=mxfp4" \
  --override "layers\.(\d+)\.mlp\.switch_mlp\.up_proj=mxfp4" \
  --override "layers\.(\d+)\.mlp\.shared_experts\.up_proj=mxfp4" \
  --override "layers\.(\d+)\.linear_attn\.in_proj_ba=4" \
  --override "layers\.(\d+)\.linear_attn\.conv1d=float32" \
  --override "layers\.(0|1|2|4|5|6|8|9|10|13|14|18|20|21|22|24|25|26|28|29|33|36|37|38|40|41|42)\.linear_attn\.out_proj=5" \
  --override "layers\.(12|16|17|30|32|34|44)\.linear_attn\.out_proj=6" \
  --override "layers\.(45|46)\.linear_attn\.out_proj=8" \
  --override "layers\.(3|7|11|15|19|23|27|31|35|39)\.self_attn\.k_proj=mxfp4" \
  --override "layers\.(43|47)\.self_attn\.k_proj=5" \
  --override "layers\.(3|7|11|15|19|23|27|31|35|39)\.self_attn\.o_proj=mxfp4" \
  --override "layers\.(43|47)\.self_attn\.o_proj=5" \
  --override "layers\.(3|7|11|15|19|23|27|31|35|39)\.self_attn\.q_proj=mxfp4" \
  --override "layers\.(43|47)\.self_attn\.q_proj=5" \
  --override "layers\.(3|7|11|15|19|23|27|31|35|39)\.self_attn\.v_proj=mxfp4" \
  --override "layers\.(43|47)\.self_attn\.v_proj=5" \
  --hf-path Qwen/Qwen3-Coder-Next \
  --mlx-path ....

Also fixes a subtle issue where convert() hardcoded q_group_size=64 and q_bits=4 as signature defaults. The CLI was fine (argparse passesNone), but calling convert() from Python with q_mode="mxfp4" would silently use the wrong group size. Now both paths go through the same mode-aware default logic.

Adds unit tests for the three new helpers (parse_overrides, build_override_predicate, apply_float_overrides).

@ivanfioravanti
Copy link
Contributor

This can be interesting to create custom quantizations 🤔

@spicyneuron
Copy link
Contributor Author

@ivanfioravanti Exactly! Nothing groundbreaking from the example above, but it's a small perplexity improvement with a small cost in speed: https://huggingface.co/spicyneuron/Qwen3-Next-Coder-MLX-mixed-4.5-bit

I saw a bigger difference in more aggressive quants, where uniform 3-bit or mixed_2_4 isn't stable, but a mixed quant is: https://huggingface.co/spicyneuron/Kimi-K2.5-MLX-mixed-2.8-bit

@spicyneuron spicyneuron marked this pull request as draft February 27, 2026 08:33
@spicyneuron
Copy link
Contributor Author

Discovered some edge cases while inspecting quant results. Moved to draft until I'm confident those are settled.

@spicyneuron spicyneuron marked this pull request as ready for review March 1, 2026 04:42
@spicyneuron
Copy link
Contributor Author

spicyneuron commented Mar 1, 2026

The issue was the convert() bug I mentioned was only partially addressed here. I've extracted the fix into a separate PR so it's easier to review: #935

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants