feat(nemotron_h): add Multi-Token Prediction (MTP) module#1161
Open
Thump604 wants to merge 1 commit intoml-explore:mainfrom
Open
feat(nemotron_h): add Multi-Token Prediction (MTP) module#1161Thump604 wants to merge 1 commit intoml-explore:mainfrom
Thump604 wants to merge 1 commit intoml-explore:mainfrom
Conversation
Nemotron-3-Super-120B ships MTP prediction heads (1,040 keys covering attention + 512-expert MoE) but neither HF transformers nor mlx-lm currently use them — both explicitly strip `mtp.*` weights during load. This commit adds native MTP support to the Nemotron-H model: - NemotronHMTPModule: dual-norm embedding/hidden fusion via eh_proj, followed by attention + MoE layers matching the mtp_hybrid_override_pattern - NemotronHMTPBlock: supports attention (*), MoE (E), and MLP (-) types - Model gains mtp_forward(), make_mtp_cache(), and return_hidden on __call__ - sanitize() remaps HF mtp.layers.0.* keys and stacks 512-expert weights - Weight stripping of mtp.* keys removed Tested: 38% MTP acceptance rate on Nemotron-3-Super-120B-A12B-5bit with extracted FP16 MTP weights. Coherent generation confirmed. The generate-level mtp_generate_step() integration is a separate concern and will follow in a subsequent PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Nemotron-3-Super-120B ships MTP prediction heads (1,040 weight keys covering attention + 512-expert MoE) in its HuggingFace checkpoint, but neither HF transformers nor mlx-lm currently use them —
sanitize()explicitly stripsmtp.*keys.This PR adds native MTP support to the Nemotron-H model definition:
NemotronHMTPModule: dual-norm embedding/hidden fusion viaeh_proj, followed by attention + MoE layers matchingmtp_hybrid_override_pattern(*E= 1 attention + 1 MoE layer)NemotronHMTPBlock: supports attention (*), MoE (E), and MLP (-) block typesmtp_forward(),make_mtp_cache(), andreturn_hiddenparameter on__call__— the standard MTP model contract already used by Qwen3.5 modelssanitize()remaps HFmtp.layers.0.*→ flatmtp.*keys and stacks 512-expert weight shardsmtp.*weight stripping so MTP weights are loaded when presentTest results
38% MTP acceptance rate on
Nemotron-3-Super-120B-A12B-5bitwith extracted FP16 MTP weights (5.5 GB from BF16 shards 49-50). Coherent generation confirmed.Scope
This is model-only — it adds the MTP architecture and weight loading to
nemotron_h.py. The generate-levelmtp_generate_step()integration (which callsmtp_forward()during decoding) is a separate concern for a follow-up PR.The model-level interface (
mtp_forward,make_mtp_cache,return_hidden) matches the existing Qwen3.5 MTP contract, so existing MTP generate infrastructure can use it without modification.MTP weight availability
The MTP weights are present in the original NVIDIA checkpoint but need to be extracted separately since they span the last 2 of 50 BF16 safetensors shards. A conversion script for extracting and stacking the MTP weights is available separately.