Skip to content

Add DeepSeek-V4 (Flash) model support#1201

Draft
akashgoswami wants to merge 2 commits intoml-explore:mainfrom
akashgoswami:add-deepseek-v4
Draft

Add DeepSeek-V4 (Flash) model support#1201
akashgoswami wants to merge 2 commits intoml-explore:mainfrom
akashgoswami:add-deepseek-v4

Conversation

@akashgoswami
Copy link
Copy Markdown

Summary

Adds DeepseekV4ForCausalLM architecture support to mlx-lm. Verified end-to-end on mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit — produces coherent text generation.

Architecture pieces implemented:

  • MLA attention with low-rank Q (wq_a → q_norm → wq_b), single shared K=V head, grouped low-rank output (wo_a × 8 + wo_b).
  • Compressor (compress_ratio ∈ {0, 4, 128}): learned gated pooling builds compressed KV rows during prefill, concatenated to the attention K/V stream. Overlapping windows for ratio==4 (matches reference).
  • Indexer module loaded for ratio==4 layers (weights load; topk gather is a follow-up — current path uses the full compressed pool).
  • Manifold-constrained Hyper-Connections (mHC): hc_pre Sinkhorn-projects mixing weights and reduces hc_mult parallel hidden states to 1; hc_post expands back via post · f_out + comb @ residual. Final-layer head uses the simpler sigmoid-weighted reduction.
  • Hash-routed MoE for the first num_hash_layers (tid2eid lookup); score-routed thereafter with sqrtsoftplus / sigmoid / softmax. Shared experts have no swiglu_limit (matches reference).
  • YaRN-scaled RoPE with compress_rope_theta on compress layers, vanilla rope_theta on non-compress layers. Inverse RoPE applied to output rope dims since K==V means V carries position into the attention output.
  • sanitize() stacks per-expert weights to SwitchLinear layout when raw HF naming is detected, and drops MTP weights (training-only).
  • cast_predicate keeps fp32 mHC params, attn_sink, and gate.bias unconverted.

Test plan

  • Loads mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit without weight key errors
  • Generates coherent text (256GB Mac Studio, 161GB peak, ~13 tok/s greedy)
    • Prompt: Once upon a time, in a forest far away, there lived a
    • Output: little girl named Red Riding Hood. She was known for her bright red cloak that she wore everywhere she went...
  • Factual recall: The capital of France is Paris
  • Verify on Pro variant (mlx-community/deepseek-ai-DeepSeek-V4-Pro-*) — no access yet
  • Sliding-window cache + indexer topk gather (follow-up PR for long-context efficiency)
  • MTP block (training-only — currently dropped in sanitize)

Notes

There are a handful of in-flight PRs for the same architecture (#1189, #1190, #1192, #1195). This is a clean-room implementation that loads the mlx-community/deepseek-ai-DeepSeek-V4-Flash-*bit checkpoint family (short-prefix naming with split wo_a.0..7) without remap, and produces correct outputs end-to-end.

🤖 Generated with Claude Code

akashgoswami and others added 2 commits April 26, 2026 11:15
Implements DeepseekV4ForCausalLM architecture for mlx-lm:

- Multi-head Latent Attention (MLA): low-rank Q (wq_a -> q_norm -> wq_b),
  single shared K=V head, grouped low-rank output projection (wo_a x 8 -> wo_b).
- Per-layer compressor (compress_ratio in {0, 4, 128}): learned gated pooling
  produces compressed KV rows, concatenated to the attention key/value stream
  during prefill.
- Indexer module loaded for ratio==4 layers (weights present; topk gather is
  a follow-up).
- Manifold-constrained Hyper-Connections (mHC) replacing residuals: hc_pre
  reduces hc_mult parallel hidden states to 1 via Sinkhorn-projected weights;
  hc_post expands back via post * f_out + comb @ residual.
- Hash-routed MoE for the first num_hash_layers (tid2eid lookup); score-routed
  thereafter with sqrtsoftplus / sigmoid / softmax. Shared experts have no
  swiglu_limit (matches reference).
- YaRN-scaled RoPE with compress_rope_theta on compress layers, vanilla
  rope_theta on non-compress layers; inverse RoPE on output rope dims since
  K==V means V carries position into the attention output.
- sanitize() stacks per-expert weights to SwitchLinear layout when needed and
  drops MTP weights (training-only).
- cast_predicate keeps fp32 mHC params, attn_sink, and gate.bias unconverted.

Verified end-to-end on mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit
(256GB Mac Studio, 161GB peak, ~13 tok/s greedy decode):
  Prompt: "Once upon a time, in a forest far away, there lived a"
  Output: "little girl named Red Riding Hood. She was known for her bright
  red cloak..."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bool attention masks (returned by BatchKVCache.make_mask) use True=visible.
The compressed-KV pad branch was using mx.zeros which evaluates to False on
bool masks, blocking visibility of compressed positions instead of allowing
them. Use the appropriate fill value for the mask dtype.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ivanfioravanti
Copy link
Copy Markdown
Contributor

Why opening another one if there are 4 opened already?

@akashgoswami akashgoswami marked this pull request as draft April 26, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants