Skip to content

fix(utils): skip already-quantized layers in load_model._quantize predicate#1216

Open
adurham wants to merge 1 commit intoml-explore:mainfrom
adurham:mlx-lm-quantize-skip-already-quantized
Open

fix(utils): skip already-quantized layers in load_model._quantize predicate#1216
adurham wants to merge 1 commit intoml-explore:mainfrom
adurham:mlx-lm-quantize-skip-already-quantized

Conversation

@adurham
Copy link
Copy Markdown
Contributor

@adurham adurham commented Apr 27, 2026

Summary

Models that pre-quantize specific layers in their __init__ (for example DeepSeek V4's DeepseekV4MoE calling SwitchLinear.to_quantized(..., mode=\"mxfp4\") on its expert projections so the experts have a non-default quantization mode) trip load_model._quantize's walker if those same layer paths also appear in config[\"quantization\"] as per-layer overrides. The predicate returns the override dict for the path, so nn.quantize tries to re-quantize the already-Quantized* module and raises:

ValueError: Unable to quantize model of type
<class 'mlx_lm.models.switch_layers.QuantizedSwitchLinear'>

The existing not hasattr(m, \"to_quantized\") clause already encodes the "module is already quantized, skip" intent — moving it ahead of the per-layer-override check makes that intent take effect even when the override map covers a pre-quantized path. End state for non-pre-quantized layers is unchanged.

Repro

from mlx_lm import load
model, tok = load(\"mlx-community/DeepSeek-V4-Flash-6bit\")
# ValueError: Unable to quantize model of type <class 'mlx_lm.models.switch_layers.QuantizedSwitchLinear'>

The 6bit checkpoint declares model.layers.<i>.ffn.switch_mlp.{gate,up,down}_proj overrides at mode=\"mxfp4\", bits=4, group_size=32 — matching what the model code pre-applies — and that's what triggers the re-quantize attempt.

Diff

One-line move; ~8 lines counting the comment:

 def class_predicate(p, m):
+    # Skip layers already quantized at construction time...
+    if not hasattr(m, \"to_quantized\"):
+        return False
     # Handle custom per layer quantizations
     if p in config[\"quantization\"]:
         return config[\"quantization\"][p]
-    if not hasattr(m, \"to_quantized\"):
-        return False
     return f\"{p}.scales\" in weights

🤖 Generated with Claude Code

…dicate

Models that pre-quantize specific layers in their `__init__` (for
example DeepSeek V4's `DeepseekV4MoE` calling
`SwitchLinear.to_quantized(..., mode="mxfp4")` on its expert
projections so the experts have a non-default quantization mode) trip
`load_model._quantize`'s walker if those same layer paths also appear
in `config["quantization"]` as per-layer overrides. The predicate
returns the override dict for the path, so `nn.quantize` tries to
re-quantize the already-`Quantized*` module and raises:

    ValueError: Unable to quantize model of type
    <class 'mlx_lm.models.switch_layers.QuantizedSwitchLinear'>

The existing `not hasattr(m, "to_quantized")` clause already encodes
the "module is already quantized, skip" intent — moving it ahead of
the per-layer-override check makes that intent take effect even when
the override map covers a pre-quantized path. End state for
non-pre-quantized layers is unchanged.

Reproducible on `mlx-community/DeepSeek-V4-Flash-6bit` whose config
declares `model.layers.<i>.ffn.switch_mlp.{gate,up,down}_proj` overrides
at `mode="mxfp4", bits=4, group_size=32` — matching what the model
code pre-applies — and triggers the re-quantize attempt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant