Add DeepSeek-V4 (Flash) model support by akashgoswami · Pull Request #1201 · ml-explore/mlx-lm

akashgoswami · 2026-04-26T09:16:08Z

Summary

Adds DeepseekV4ForCausalLM architecture support to mlx-lm. Verified end-to-end on mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit — produces coherent text generation.

Architecture pieces implemented:

MLA attention with low-rank Q (wq_a → q_norm → wq_b), single shared K=V head, grouped low-rank output (wo_a × 8 + wo_b).
Compressor (compress_ratio ∈ {0, 4, 128}): learned gated pooling builds compressed KV rows during prefill, concatenated to the attention K/V stream. Overlapping windows for ratio==4 (matches reference).
Indexer module loaded for ratio==4 layers (weights load; topk gather is a follow-up — current path uses the full compressed pool).
Manifold-constrained Hyper-Connections (mHC): hc_pre Sinkhorn-projects mixing weights and reduces hc_mult parallel hidden states to 1; hc_post expands back via post · f_out + comb @ residual. Final-layer head uses the simpler sigmoid-weighted reduction.
Hash-routed MoE for the first num_hash_layers (tid2eid lookup); score-routed thereafter with sqrtsoftplus / sigmoid / softmax. Shared experts have no swiglu_limit (matches reference).
YaRN-scaled RoPE with compress_rope_theta on compress layers, vanilla rope_theta on non-compress layers. Inverse RoPE applied to output rope dims since K==V means V carries position into the attention output.
sanitize() stacks per-expert weights to SwitchLinear layout when raw HF naming is detected, and drops MTP weights (training-only).
cast_predicate keeps fp32 mHC params, attn_sink, and gate.bias unconverted.

Test plan

Loads mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit without weight key errors
Generates coherent text (256GB Mac Studio, 161GB peak, ~13 tok/s greedy)
- Prompt: Once upon a time, in a forest far away, there lived a
- Output: little girl named Red Riding Hood. She was known for her bright red cloak that she wore everywhere she went...
Factual recall: The capital of France is → Paris
Verify on Pro variant (mlx-community/deepseek-ai-DeepSeek-V4-Pro-*) — no access yet
Sliding-window cache + indexer topk gather (follow-up PR for long-context efficiency)
MTP block (training-only — currently dropped in sanitize)

Notes

There are a handful of in-flight PRs for the same architecture (#1189, #1190, #1192, #1195). This is a clean-room implementation that loads the mlx-community/deepseek-ai-DeepSeek-V4-Flash-*bit checkpoint family (short-prefix naming with split wo_a.0..7) without remap, and produces correct outputs end-to-end.

🤖 Generated with Claude Code

Implements DeepseekV4ForCausalLM architecture for mlx-lm: - Multi-head Latent Attention (MLA): low-rank Q (wq_a -> q_norm -> wq_b), single shared K=V head, grouped low-rank output projection (wo_a x 8 -> wo_b). - Per-layer compressor (compress_ratio in {0, 4, 128}): learned gated pooling produces compressed KV rows, concatenated to the attention key/value stream during prefill. - Indexer module loaded for ratio==4 layers (weights present; topk gather is a follow-up). - Manifold-constrained Hyper-Connections (mHC) replacing residuals: hc_pre reduces hc_mult parallel hidden states to 1 via Sinkhorn-projected weights; hc_post expands back via post * f_out + comb @ residual. - Hash-routed MoE for the first num_hash_layers (tid2eid lookup); score-routed thereafter with sqrtsoftplus / sigmoid / softmax. Shared experts have no swiglu_limit (matches reference). - YaRN-scaled RoPE with compress_rope_theta on compress layers, vanilla rope_theta on non-compress layers; inverse RoPE on output rope dims since K==V means V carries position into the attention output. - sanitize() stacks per-expert weights to SwitchLinear layout when needed and drops MTP weights (training-only). - cast_predicate keeps fp32 mHC params, attn_sink, and gate.bias unconverted. Verified end-to-end on mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit (256GB Mac Studio, 161GB peak, ~13 tok/s greedy decode): Prompt: "Once upon a time, in a forest far away, there lived a" Output: "little girl named Red Riding Hood. She was known for her bright red cloak..." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bool attention masks (returned by BatchKVCache.make_mask) use True=visible. The compressed-KV pad branch was using mx.zeros which evaluates to False on bool masks, blocking visibility of compressed positions instead of allowing them. Use the appropriate fill value for the mask dtype. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ivanfioravanti · 2026-04-26T13:41:28Z

Why opening another one if there are 4 opened already?

akashgoswami and others added 2 commits April 26, 2026 11:15

akashgoswami marked this pull request as draft April 26, 2026 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSeek-V4 (Flash) model support#1201

Add DeepSeek-V4 (Flash) model support#1201
akashgoswami wants to merge 2 commits intoml-explore:mainfrom
akashgoswami:add-deepseek-v4

akashgoswami commented Apr 26, 2026

Uh oh!

ivanfioravanti commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akashgoswami commented Apr 26, 2026

Summary

Test plan

Notes

Uh oh!

ivanfioravanti commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants