Add DeepSeek-V4 (Flash) model support#1201
Draft
akashgoswami wants to merge 2 commits intoml-explore:mainfrom
Draft
Add DeepSeek-V4 (Flash) model support#1201akashgoswami wants to merge 2 commits intoml-explore:mainfrom
akashgoswami wants to merge 2 commits intoml-explore:mainfrom
Conversation
Implements DeepseekV4ForCausalLM architecture for mlx-lm:
- Multi-head Latent Attention (MLA): low-rank Q (wq_a -> q_norm -> wq_b),
single shared K=V head, grouped low-rank output projection (wo_a x 8 -> wo_b).
- Per-layer compressor (compress_ratio in {0, 4, 128}): learned gated pooling
produces compressed KV rows, concatenated to the attention key/value stream
during prefill.
- Indexer module loaded for ratio==4 layers (weights present; topk gather is
a follow-up).
- Manifold-constrained Hyper-Connections (mHC) replacing residuals: hc_pre
reduces hc_mult parallel hidden states to 1 via Sinkhorn-projected weights;
hc_post expands back via post * f_out + comb @ residual.
- Hash-routed MoE for the first num_hash_layers (tid2eid lookup); score-routed
thereafter with sqrtsoftplus / sigmoid / softmax. Shared experts have no
swiglu_limit (matches reference).
- YaRN-scaled RoPE with compress_rope_theta on compress layers, vanilla
rope_theta on non-compress layers; inverse RoPE on output rope dims since
K==V means V carries position into the attention output.
- sanitize() stacks per-expert weights to SwitchLinear layout when needed and
drops MTP weights (training-only).
- cast_predicate keeps fp32 mHC params, attn_sink, and gate.bias unconverted.
Verified end-to-end on mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit
(256GB Mac Studio, 161GB peak, ~13 tok/s greedy decode):
Prompt: "Once upon a time, in a forest far away, there lived a"
Output: "little girl named Red Riding Hood. She was known for her bright
red cloak..."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bool attention masks (returned by BatchKVCache.make_mask) use True=visible. The compressed-KV pad branch was using mx.zeros which evaluates to False on bool masks, blocking visibility of compressed positions instead of allowing them. Use the appropriate fill value for the mask dtype. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Why opening another one if there are 4 opened already? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
DeepseekV4ForCausalLMarchitecture support to mlx-lm. Verified end-to-end onmlx-community/deepseek-ai-DeepSeek-V4-Flash-4bit— produces coherent text generation.Architecture pieces implemented:
wq_a → q_norm → wq_b), single shared K=V head, grouped low-rank output (wo_a× 8 +wo_b).compress_ratio ∈ {0, 4, 128}): learned gated pooling builds compressed KV rows during prefill, concatenated to the attention K/V stream. Overlapping windows for ratio==4 (matches reference).hc_preSinkhorn-projects mixing weights and reduceshc_multparallel hidden states to 1;hc_postexpands back viapost · f_out + comb @ residual. Final-layer head uses the simpler sigmoid-weighted reduction.num_hash_layers(tid2eidlookup); score-routed thereafter withsqrtsoftplus/sigmoid/softmax. Shared experts have noswiglu_limit(matches reference).compress_rope_thetaon compress layers, vanillarope_thetaon non-compress layers. Inverse RoPE applied to output rope dims since K==V means V carries position into the attention output.sanitize()stacks per-expert weights toSwitchLinearlayout when raw HF naming is detected, and drops MTP weights (training-only).cast_predicatekeeps fp32 mHC params,attn_sink, andgate.biasunconverted.Test plan
mlx-community/deepseek-ai-DeepSeek-V4-Flash-4bitwithout weight key errorsOnce upon a time, in a forest far away, there lived alittle girl named Red Riding Hood. She was known for her bright red cloak that she wore everywhere she went...The capital of France is→Parismlx-community/deepseek-ai-DeepSeek-V4-Pro-*) — no access yetsanitize)Notes
There are a handful of in-flight PRs for the same architecture (#1189, #1190, #1192, #1195). This is a clean-room implementation that loads the
mlx-community/deepseek-ai-DeepSeek-V4-Flash-*bitcheckpoint family (short-prefix naming with splitwo_a.0..7) without remap, and produces correct outputs end-to-end.🤖 Generated with Claude Code