Add dense `qwen3_5` support for learned quantization by iamwavecut · Pull Request #1200 · ml-explore/mlx-lm

iamwavecut · 2026-04-25T22:23:23Z

Summary

This PR adds learned quantization support for dense qwen3_5 text models and hardens DWQ failure handling.

Changes:

Enable AWQ traversal for dense qwen3_5 hybrid layers:
- full attention blocks with self_attn.{q,k,v,o}_proj
- linear-attention blocks with linear_attn.in_proj_*
- dense MLP projections
Avoid the qwen3_5 inference-only gated-delta CustomKernel during dynamic quantization sensitivity estimation.
Add dynamic_quant calibration controls for small local runs:
- --num-samples
- --sequence-length
Add DWQ KL loss implementation selection:
- --kl-loss-impl auto|metal|mlx
Add DWQ target metadata validation before training.
Add DWQ shape and finite-value checks so NaN/Inf losses or weights fail before saving a corrupted checkpoint.
Add unit tests and an opt-in slow qwen3_5 quantization smoke test.

This intentionally focuses on dense qwen3_5 models. MoE support is left out of scope.

Motivation

Plain conversion and GPTQ already work for qwen3_5 models, but other learned quantization paths had gaps:

AWQ rejected qwen3_5 as unsupported.
Dynamic quantization could hit [Primitive::vjp] Not implemented for CustomKernel on qwen3_5 linear-attention paths.
DWQ could fail on Apple Silicon devices where the Metal KL-loss kernel configuration is not usable, and unstable runs could continue far enough to save invalid checkpoints.

Testing

uv run python -m unittest tests.test_losses tests.test_quantization -v

uv run python -m unittest discover tests -v

uv run pre-commit run --files \
  mlx_lm/LEARNED_QUANTS.md \
  mlx_lm/quant/awq.py \
  mlx_lm/quant/dwq.py \
  mlx_lm/quant/dynamic_quant.py \
  mlx_lm/tuner/losses.py \
  tests/test_losses.py \
  tests/test_quantization.py \
  tests/test_qwen3_5_quantization_slow.py

Opt-in slow smoke test on Qwen/Qwen3.5-0.8B:

RUN_SLOW_QWEN35_QUANT_TESTS=1 \
QWEN35_QUANT_TEST_MODEL=/path/to/Qwen3.5-0.8B \
uv run python -m unittest tests.test_qwen3_5_quantization_slow -v

Local smoke checks also verified that AWQ, dynamic quantization, and DWQ start and save loadable artifacts for the downstream finetuned model LakoMoor/QClaw-4B. GPTQ was checked to reach Hessian collection and quantization startup as an existing baseline.

Enable AWQ traversal for dense qwen3_5 hybrid blocks, including linear-attention, full-attention, and dense MLP projections. Avoid qwen3_5 inference-only CustomKernel VJP during dynamic quantization sensitivity estimation, and expose calibration controls for smaller local smoke runs. Harden DWQ with KL loss fallback selection, target metadata validation, shape checks, and non-finite loss/weight guards so corrupted checkpoints fail before save. Add unit coverage and opt-in slow qwen3_5 quantization smoke tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dense `qwen3_5` support for learned quantization#1200

Add dense `qwen3_5` support for learned quantization#1200
iamwavecut wants to merge 1 commit intoml-explore:mainfrom
iamwavecut:research/qwen3-5-quant-lab

iamwavecut commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iamwavecut commented Apr 25, 2026

Summary

Motivation

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant