[None][fix] Enable MiniMax M3 piecewise CUDA graphs by liji-nv · Pull Request #15923 · NVIDIA/TensorRT-LLM

liji-nv · 2026-07-03T12:28:00Z

Wrap the MiniMax M3 metadata- and cache-dependent attention core in an inplace custom op so torch.compile can split it out of piecewise CUDA graphs. Keep QKV/index projections, QK normalization, RoPE, and the output projection visible to the compiled graph.

Write dense and sparse attention results into the custom-op output buffer. Preserve FP32 sparse GQA accumulation until the final copy/cast, and expose the output buffer through MiniMaxM3SparseRuntimeBackend.forward.

Register attention boundaries and mutation metadata through optional TRT-LLM op lookup, matching the latest GDN registration pattern from PR #15594. This avoids depending on model-specific custom ops being imported when compilation utilities initialize.

Track piecewise runners owned by the compile backend and reset their CUDA graphs, captured addresses, outputs, and warmup state when phase-1 KV-cache estimation is released. Phase 2 then recaptures against the final allocations instead of replaying stale graph pointers.

Add an 8-GPU MiniMax-M3-MXFP8 torch.compile E2E variant covering TP8/EP8, attention DP, TRTLLM MoE, padding CUDA graphs, multi-stream piecewise capture, and phase-2 recapture.

Summary by CodeRabbit

New Features
- Added support for writing attention results into preallocated output buffers, improving flexibility for MiniMax-M3 attention paths.
- Expanded compile-time piecewise CUDA graph handling so captured runners can be tracked and cleared later.
- Introduced broader support for MiniMax-M3 dense and sparse attention execution under compile-aware workflows.
Bug Fixes
- Improved cleanup of CUDA graph resources during model shutdown.
- Added coverage for a new MiniMax-M3 MXFP8 piecewise CUDA graph scenario.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Wrap the MiniMax M3 metadata- and cache-dependent attention core in an inplace custom op so torch.compile can split it out of piecewise CUDA graphs. Keep QKV/index projections, QK normalization, RoPE, and the output projection visible to the compiled graph. Write dense and sparse attention results into the custom-op output buffer. Preserve FP32 sparse GQA accumulation until the final copy/cast, and expose the output buffer through MiniMaxM3SparseRuntimeBackend.forward. Register attention boundaries and mutation metadata through optional TRT-LLM op lookup, matching the latest GDN registration pattern from PR NVIDIA#15594. This avoids depending on model-specific custom ops being imported when compilation utilities initialize. Track piecewise runners owned by the compile backend and reset their CUDA graphs, captured addresses, outputs, and warmup state when phase-1 KV-cache estimation is released. Phase 2 then recaptures against the final allocations instead of replaying stale graph pointers. Add an 8-GPU MiniMax-M3-MXFP8 torch.compile E2E variant covering TP8/EP8, attention DP, TRTLLM MoE, padding CUDA graphs, multi-stream piecewise capture, and phase-2 recapture. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>

coderabbitai · 2026-07-03T12:30:21Z

📝 Walkthrough

Walkthrough

Adds an optional preallocated output tensor parameter threaded through MiniMax-M3 sparse attention functions, runtime backend, and dense/sparse attention cores, with a compiled custom-op boundary for the dense path. Also adds tracking and cleanup of piecewise CUDA graph runners in the torch.compile backend, invoked from model engine cleanup.

Changes

MiniMax-M3 output tensor plumbing

Layer / File(s)	Summary
Sparse GQA masked output buffer support `tensorrt_llm/_torch/attention_backend/sparse/minimax_m3/backend.py`	`_sparse_gqa_masked` and its decode/prefill entry points accept an optional `output` tensor and write results directly into it after FP32 accumulation, validating shape when provided.
Runtime backend output resolution and forwarding `tensorrt_llm/_torch/attention_backend/sparse/minimax_m3/backend.py`	`MiniMaxM3SparseRuntimeBackend.forward_sparse` and `forward` accept/merge `output` from `AttentionForwardArgs` and kwargs, raise on conflicting values, and pass the resolved buffer into decode/prefill calls.
MiniMax-M3 attention core writes into output buffer `tensorrt_llm/_torch/models/modeling_minimaxm3.py`	Dense and sparse paths in `MiniMaxM3Attention` add `_forward_attention_core`/`_attention_core`/`_sparse_attention_core`, validate metadata earlier, and write results directly into a caller-provided output buffer instead of building intermediate tensors.
Compiled custom-op boundary and optional op lookup `tensorrt_llm/_torch/models/modeling_minimaxm3.py`, `tensorrt_llm/_torch/compilation/utils.py`	Adds `minimax_m3_attn_custom_op_inplace` registered via `torch.library.custom_op`, a `get_optional_trtllm_op` helper, and conditional `inplace_map` entries exposing available ops to the piecewise compiler.
MXFP8 piecewise CUDA graph test coverage `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/test_lists/qa/llm_function_core.txt`	Adds an integration test running MiniMax-M3-MXFP8 with piecewise CUDA graph and torch.compile settings, plus a QA test-list entry.

Piecewise CUDA graph runner tracking and cleanup

Layer / File(s)	Summary
PiecewiseRunner collection and clear_cuda_graphs `tensorrt_llm/_torch/compilation/piecewise_optimizer.py`	`PiecewiseInterpreter` collects created `PiecewiseRunner` instances and `piecewise_optimizer` returns them; `PiecewiseRunner.clear_cuda_graphs()` resets captured graphs and cached state.
Backend tracking and clear_piecewise_cuda_graphs `tensorrt_llm/_torch/compilation/backend.py`	`Backend` stores returned runners in a `WeakSet` and exposes `clear_piecewise_cuda_graphs()` to invoke `clear_cuda_graphs()` on each tracked runner.
Model engine invokes piecewise cleanup `tensorrt_llm/_torch/pyexecutor/model_engine.py`	`_release_cuda_graphs` calls `clear_piecewise_cuda_graphs()` on `_torch_compile_backend` before clearing other CUDA graph runners.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant RuntimeBackend as MiniMaxM3SparseRuntimeBackend
  participant Decode as minimax_m3_sparse_decode/prefill
  participant Masked as _sparse_gqa_masked

  Caller->>RuntimeBackend: forward(forward_args, output, kwargs)
  RuntimeBackend->>RuntimeBackend: merge forward_args/kwargs, resolve output
  RuntimeBackend->>Decode: forward_sparse(output=resolved_output)
  Decode->>Masked: _sparse_gqa_masked(output=resolved_output)
  Masked-->>Decode: writes result into output tensor

sequenceDiagram
  participant ModelEngine as PyTorchModelEngine
  participant Backend
  participant Runner as PiecewiseRunner

  ModelEngine->>Backend: clear_piecewise_cuda_graphs()
  Backend->>Runner: clear_cuda_graphs() (for each tracked runner)
  Runner-->>Backend: resets captured graphs and cached state

Suggested reviewers: QiJune, fredricz-20070104, Tabrizian, jieli-matrix

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title is concise and accurately summarizes the main change: enabling MiniMax M3 piecewise CUDA graphs.
Description check	✅ Passed	The description explains the motivation, solution, and added test coverage, though it could be more explicitly structured per the template.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

liji-nv · 2026-07-03T12:34:19Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-03T12:39:43Z

PR_Github #57464 [ run ] triggered by Bot. Commit: c333655 Link to invocation

tensorrt-cicd · 2026-07-03T20:49:14Z

PR_Github #57464 [ run ] completed with state SUCCESS. Commit: c333655
/LLM/main/L0_MergeRequest_PR pipeline #46201 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

liji-nv requested review from a team as code owners July 3, 2026 12:28

liji-nv requested review from yechank-nvidia, yizhang-nv and yuxianq July 3, 2026 12:28

github-actions Bot assigned liji-nv Jul 3, 2026

liji-nv force-pushed the fix/minimax-m3-piecewise-cudagraph branch from bfdfb57 to 31496ff Compare July 3, 2026 12:28

liji-nv force-pushed the fix/minimax-m3-piecewise-cudagraph branch from 31496ff to c333655 Compare July 3, 2026 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][fix] Enable MiniMax M3 piecewise CUDA graphs#15923

[None][fix] Enable MiniMax M3 piecewise CUDA graphs#15923
liji-nv wants to merge 1 commit into
NVIDIA:mainfrom
liji-nv:fix/minimax-m3-piecewise-cudagraph

liji-nv commented Jul 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

liji-nv commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

liji-nv commented Jul 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

liji-nv commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liji-nv commented Jul 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading