Skip to content

[None][fix] Enable MiniMax M3 piecewise CUDA graphs#15923

Open
liji-nv wants to merge 1 commit into
NVIDIA:mainfrom
liji-nv:fix/minimax-m3-piecewise-cudagraph
Open

[None][fix] Enable MiniMax M3 piecewise CUDA graphs#15923
liji-nv wants to merge 1 commit into
NVIDIA:mainfrom
liji-nv:fix/minimax-m3-piecewise-cudagraph

Conversation

@liji-nv

@liji-nv liji-nv commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Wrap the MiniMax M3 metadata- and cache-dependent attention core in an inplace custom op so torch.compile can split it out of piecewise CUDA graphs. Keep QKV/index projections, QK normalization, RoPE, and the output projection visible to the compiled graph.

Write dense and sparse attention results into the custom-op output buffer. Preserve FP32 sparse GQA accumulation until the final copy/cast, and expose the output buffer through MiniMaxM3SparseRuntimeBackend.forward.

Register attention boundaries and mutation metadata through optional TRT-LLM op lookup, matching the latest GDN registration pattern from PR #15594. This avoids depending on model-specific custom ops being imported when compilation utilities initialize.

Track piecewise runners owned by the compile backend and reset their CUDA graphs, captured addresses, outputs, and warmup state when phase-1 KV-cache estimation is released. Phase 2 then recaptures against the final allocations instead of replaying stale graph pointers.

Add an 8-GPU MiniMax-M3-MXFP8 torch.compile E2E variant covering TP8/EP8, attention DP, TRTLLM MoE, padding CUDA graphs, multi-stream piecewise capture, and phase-2 recapture.

Summary by CodeRabbit

  • New Features

    • Added support for writing attention results into preallocated output buffers, improving flexibility for MiniMax-M3 attention paths.
    • Expanded compile-time piecewise CUDA graph handling so captured runners can be tracked and cleared later.
    • Introduced broader support for MiniMax-M3 dense and sparse attention execution under compile-aware workflows.
  • Bug Fixes

    • Improved cleanup of CUDA graph resources during model shutdown.
    • Added coverage for a new MiniMax-M3 MXFP8 piecewise CUDA graph scenario.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@liji-nv liji-nv requested review from a team as code owners July 3, 2026 12:28
@liji-nv liji-nv force-pushed the fix/minimax-m3-piecewise-cudagraph branch from bfdfb57 to 31496ff Compare July 3, 2026 12:28
Wrap the MiniMax M3 metadata- and cache-dependent attention core in an inplace custom op so torch.compile can split it out of piecewise CUDA graphs. Keep QKV/index projections, QK normalization, RoPE, and the output projection visible to the compiled graph.

Write dense and sparse attention results into the custom-op output buffer. Preserve FP32 sparse GQA accumulation until the final copy/cast, and expose the output buffer through MiniMaxM3SparseRuntimeBackend.forward.

Register attention boundaries and mutation metadata through optional TRT-LLM op lookup, matching the latest GDN registration pattern from PR NVIDIA#15594. This avoids depending on model-specific custom ops being imported when compilation utilities initialize.

Track piecewise runners owned by the compile backend and reset their CUDA graphs, captured addresses, outputs, and warmup state when phase-1 KV-cache estimation is released. Phase 2 then recaptures against the final allocations instead of replaying stale graph pointers.

Add an 8-GPU MiniMax-M3-MXFP8 torch.compile E2E variant covering TP8/EP8, attention DP, TRTLLM MoE, padding CUDA graphs, multi-stream piecewise capture, and phase-2 recapture.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
@liji-nv liji-nv force-pushed the fix/minimax-m3-piecewise-cudagraph branch from 31496ff to c333655 Compare July 3, 2026 12:29
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds an optional preallocated output tensor parameter threaded through MiniMax-M3 sparse attention functions, runtime backend, and dense/sparse attention cores, with a compiled custom-op boundary for the dense path. Also adds tracking and cleanup of piecewise CUDA graph runners in the torch.compile backend, invoked from model engine cleanup.

Changes

MiniMax-M3 output tensor plumbing

Layer / File(s) Summary
Sparse GQA masked output buffer support
tensorrt_llm/_torch/attention_backend/sparse/minimax_m3/backend.py
_sparse_gqa_masked and its decode/prefill entry points accept an optional output tensor and write results directly into it after FP32 accumulation, validating shape when provided.
Runtime backend output resolution and forwarding
tensorrt_llm/_torch/attention_backend/sparse/minimax_m3/backend.py
MiniMaxM3SparseRuntimeBackend.forward_sparse and forward accept/merge output from AttentionForwardArgs and kwargs, raise on conflicting values, and pass the resolved buffer into decode/prefill calls.
MiniMax-M3 attention core writes into output buffer
tensorrt_llm/_torch/models/modeling_minimaxm3.py
Dense and sparse paths in MiniMaxM3Attention add _forward_attention_core/_attention_core/_sparse_attention_core, validate metadata earlier, and write results directly into a caller-provided output buffer instead of building intermediate tensors.
Compiled custom-op boundary and optional op lookup
tensorrt_llm/_torch/models/modeling_minimaxm3.py, tensorrt_llm/_torch/compilation/utils.py
Adds minimax_m3_attn_custom_op_inplace registered via torch.library.custom_op, a get_optional_trtllm_op helper, and conditional inplace_map entries exposing available ops to the piecewise compiler.
MXFP8 piecewise CUDA graph test coverage
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/test_lists/qa/llm_function_core.txt
Adds an integration test running MiniMax-M3-MXFP8 with piecewise CUDA graph and torch.compile settings, plus a QA test-list entry.

Piecewise CUDA graph runner tracking and cleanup

Layer / File(s) Summary
PiecewiseRunner collection and clear_cuda_graphs
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
PiecewiseInterpreter collects created PiecewiseRunner instances and piecewise_optimizer returns them; PiecewiseRunner.clear_cuda_graphs() resets captured graphs and cached state.
Backend tracking and clear_piecewise_cuda_graphs
tensorrt_llm/_torch/compilation/backend.py
Backend stores returned runners in a WeakSet and exposes clear_piecewise_cuda_graphs() to invoke clear_cuda_graphs() on each tracked runner.
Model engine invokes piecewise cleanup
tensorrt_llm/_torch/pyexecutor/model_engine.py
_release_cuda_graphs calls clear_piecewise_cuda_graphs() on _torch_compile_backend before clearing other CUDA graph runners.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant RuntimeBackend as MiniMaxM3SparseRuntimeBackend
  participant Decode as minimax_m3_sparse_decode/prefill
  participant Masked as _sparse_gqa_masked

  Caller->>RuntimeBackend: forward(forward_args, output, kwargs)
  RuntimeBackend->>RuntimeBackend: merge forward_args/kwargs, resolve output
  RuntimeBackend->>Decode: forward_sparse(output=resolved_output)
  Decode->>Masked: _sparse_gqa_masked(output=resolved_output)
  Masked-->>Decode: writes result into output tensor
Loading
sequenceDiagram
  participant ModelEngine as PyTorchModelEngine
  participant Backend
  participant Runner as PiecewiseRunner

  ModelEngine->>Backend: clear_piecewise_cuda_graphs()
  Backend->>Runner: clear_cuda_graphs() (for each tracked runner)
  Runner-->>Backend: resets captured graphs and cached state
Loading

Suggested reviewers: QiJune, fredricz-20070104, Tabrizian, jieli-matrix

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title is concise and accurately summarizes the main change: enabling MiniMax M3 piecewise CUDA graphs.
Description check ✅ Passed The description explains the motivation, solution, and added test coverage, though it could be more explicitly structured per the template.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@liji-nv

liji-nv commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57464 [ run ] triggered by Bot. Commit: c333655 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57464 [ run ] completed with state SUCCESS. Commit: c333655
/LLM/main/L0_MergeRequest_PR pipeline #46201 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants