Skip to content

Add dense qwen3_5 support for learned quantization#1200

Open
iamwavecut wants to merge 1 commit intoml-explore:mainfrom
iamwavecut:research/qwen3-5-quant-lab
Open

Add dense qwen3_5 support for learned quantization#1200
iamwavecut wants to merge 1 commit intoml-explore:mainfrom
iamwavecut:research/qwen3-5-quant-lab

Conversation

@iamwavecut
Copy link
Copy Markdown

Summary

This PR adds learned quantization support for dense qwen3_5 text models and hardens DWQ failure handling.

Changes:

  • Enable AWQ traversal for dense qwen3_5 hybrid layers:
    • full attention blocks with self_attn.{q,k,v,o}_proj
    • linear-attention blocks with linear_attn.in_proj_*
    • dense MLP projections
  • Avoid the qwen3_5 inference-only gated-delta CustomKernel during dynamic quantization sensitivity estimation.
  • Add dynamic_quant calibration controls for small local runs:
    • --num-samples
    • --sequence-length
  • Add DWQ KL loss implementation selection:
    • --kl-loss-impl auto|metal|mlx
  • Add DWQ target metadata validation before training.
  • Add DWQ shape and finite-value checks so NaN/Inf losses or weights fail before saving a corrupted checkpoint.
  • Add unit tests and an opt-in slow qwen3_5 quantization smoke test.

This intentionally focuses on dense qwen3_5 models. MoE support is left out of scope.

Motivation

Plain conversion and GPTQ already work for qwen3_5 models, but other learned quantization paths had gaps:

  • AWQ rejected qwen3_5 as unsupported.
  • Dynamic quantization could hit [Primitive::vjp] Not implemented for CustomKernel on qwen3_5 linear-attention paths.
  • DWQ could fail on Apple Silicon devices where the Metal KL-loss kernel configuration is not usable, and unstable runs could continue far enough to save invalid checkpoints.

Testing

uv run python -m unittest tests.test_losses tests.test_quantization -v
uv run python -m unittest discover tests -v
uv run pre-commit run --files \
  mlx_lm/LEARNED_QUANTS.md \
  mlx_lm/quant/awq.py \
  mlx_lm/quant/dwq.py \
  mlx_lm/quant/dynamic_quant.py \
  mlx_lm/tuner/losses.py \
  tests/test_losses.py \
  tests/test_quantization.py \
  tests/test_qwen3_5_quantization_slow.py

Opt-in slow smoke test on Qwen/Qwen3.5-0.8B:

RUN_SLOW_QWEN35_QUANT_TESTS=1 \
QWEN35_QUANT_TEST_MODEL=/path/to/Qwen3.5-0.8B \
uv run python -m unittest tests.test_qwen3_5_quantization_slow -v

Local smoke checks also verified that AWQ, dynamic quantization, and DWQ start and save loadable artifacts for the downstream finetuned model LakoMoor/QClaw-4B. GPTQ was checked to reach Hessian collection and quantization startup as an existing baseline.

Enable AWQ traversal for dense qwen3_5 hybrid blocks, including
linear-attention, full-attention, and dense MLP projections.

Avoid qwen3_5 inference-only CustomKernel VJP during dynamic quantization
sensitivity estimation, and expose calibration controls for smaller
local smoke runs.

Harden DWQ with KL loss fallback selection, target metadata validation,
shape checks, and non-finite loss/weight guards so corrupted checkpoints
fail before save.

Add unit coverage and opt-in slow qwen3_5 quantization smoke tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant