[None][fix] Enable cuda_scaled_mm fast path for FP8 linear on SM121 by souvikDevloper · Pull Request #15928 · NVIDIA/TensorRT-LLM

souvikDevloper · 2026-07-03T14:52:25Z

Description

On SM121 devices (GB10 / DGX Spark), the FP8 linear cuda_scaled_mm fast path for small batch sizes (m <= 8) is silently disabled, because the two enable_cuda_core allowlists are hard-coded to SM89 and SM120 only. As a result, decode-time GEMMs on SM121 always fall back to the slower generic cublas_scaled_mm path, with no warning or log message.

The comment at the first site ("enable cuda core for sm89 and sm120") suggests SM121 was left out unintentionally rather than deliberately excluded — SM121 supports the same fast path as SM120.

Changes

1. tensorrt_llm/_torch/modules/linear.py — used by the PyTorch backend Linear module:

# before
# enable cuda core for sm89 and sm120
self.enable_cuda_core = (capability[0] == 8 and capability[1] == 9) \
    or (capability[0] == 12 and capability[1] == 0)

# after
# enable cuda core for sm89, sm120 and sm121
self.enable_cuda_core = (capability[0] == 8 and capability[1] == 9) \
    or (capability[0] == 12 and capability[1] == 0) \
    or (capability[0] == 12 and capability[1] == 1)

2. tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py — same gate in the AutoDeploy FP8 prequant linear op:

# before
enable_cuda_core = capability == (8, 9) or capability == (12, 0)

# after
enable_cuda_core = capability in ((8, 9), (12, 0), (12, 1))

With enable_cuda_core now true on SM121, batches with m <= 8 take the optimized torch.ops.trtllm.cuda_scaled_mm path instead of falling through to cublas_scaled_mm, matching SM120 behavior.

Test Coverage

No new tests added: the change only extends an existing device-capability allowlist, and the affected path is already exercised by the existing FP8 linear unit tests on supported hardware. Verifying the SM121 branch end-to-end requires GB10 / DGX Spark hardware.

PR Checklist

PR title follows the required format: [None][fix] Enable cuda_scaled_mm fast path for FP8 linear on SM121
PR description explains what and why
Commit is DCO signed-off
Test coverage (existing tests; SM121-specific verification requires GB10 hardware)

Summary by CodeRabbit

Bug Fixes
- Expanded GPU compatibility for CUDA-core acceleration on supported hardware, including additional compute capability variants.
- Improved automatic enablement so linear layers and quantized operations can use the optimized path on more devices.

The FP8 linear cuda_scaled_mm fast path for small batch sizes (m <= 8) was gated by hard-coded allowlists covering only SM89 and SM120, silently falling back to the slower cublas path on SM121 (GB10 / DGX Spark) even though the hardware supports it. Add SM121 to both allowlists. Fixes NVIDIA#15673 Signed-off-by: souvikDevloper <gshsouvik01@gmail.com>

coderabbitai · 2026-07-03T14:54:39Z

📝 Walkthrough

Walkthrough

The compute-capability allowlist controlling the CUDA core fast path (enable_cuda_core) is extended in both linear.py and quant.py to include SM121 (12, 1), in addition to the existing SM89 (8, 9) and SM120 (12, 0) capabilities.

Changes

SM121 Support in cuda_scaled_mm Fast Path

Layer / File(s)	Summary
Extend CUDA core capability allowlist `tensorrt_llm/_torch/modules/linear.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py`	`enable_cuda_core` checks in both `Linear.__init__` and `_trtllm_fp8_prequant_linear_core` now also evaluate true for compute capability `(12, 1)`, alongside `(8, 9)` and `(12, 0)`.

Estimated code review effort: 1 (Trivial) | ~3 minutes

Related issues: #15673

Suggested labels: bug, low-risk

Suggested reviewers: Tracin

🐰 A whisker-twitch, a GPU cheer,
SM121 joins the frontier,
One-two-one, no longer left behind,
Cuda cores now fully aligned!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly names the FP8 cuda_scaled_mm fast-path fix for SM121 and matches the main code change.
Description check	✅ Passed	The description explains the bug, the two code changes, and test coverage, and it follows the required template closely.
Linked Issues check	✅ Passed	The PR updates both allowlists to include SM121, satisfying issue `#15673`'s requirement to enable the low-batch fast path.
Out of Scope Changes check	✅ Passed	The change set stays narrowly focused on the SM121 allowlist fix and introduces no obvious unrelated code changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/modules/linear.py (1)

3272-3279: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Correctly extends allowlist to SM121.

The added (capability[0] == 12 and capability[1] == 1) clause correctly enables the cuda_scaled_mm fast path for SM121, matching the PR objective and the parallel change in quant.py.

For consistency with quant.py's more concise capability in ((8, 9), (12, 0), (12, 1)) pattern, consider simplifying this chained boolean expression, though this is purely stylistic.

♻️ Optional simplification for consistency with quant.py

         self.enable_cuda_core = False
         if torch.cuda.is_available():
             capability = torch.cuda.get_device_capability(
                 torch.device('cuda:0'))
-            # enable cuda core for sm89, sm120 and sm121
-            self.enable_cuda_core = (capability[0] == 8 and capability[1] == 9) \
-                or (capability[0] == 12 and capability[1] == 0) \
-                or (capability[0] == 12 and capability[1] == 1)
+            # enable cuda core for sm89, sm120 and sm121
+            self.enable_cuda_core = capability in ((8, 9), (12, 0), (12, 1))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/linear.py` around lines 3272 - 3279, The SM121
allowlist in the cuda core gating logic is already correct, but the boolean
chain in the `enable_cuda_core` assignment inside the linear module should be
simplified for consistency with `quant.py`. Update the capability check in
`linear.py`’s CUDA availability block to use the same concise membership-style
pattern as `quant.py`, keeping the allowlist for `(8, 9)`, `(12, 0)`, and `(12,
1)` while preserving the existing `cuda_scaled_mm` fast-path behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tensorrt_llm/_torch/modules/linear.py`:
- Around line 3272-3279: The SM121 allowlist in the cuda core gating logic is
already correct, but the boolean chain in the `enable_cuda_core` assignment
inside the linear module should be simplified for consistency with `quant.py`.
Update the capability check in `linear.py`’s CUDA availability block to use the
same concise membership-style pattern as `quant.py`, keeping the allowlist for
`(8, 9)`, `(12, 0)`, and `(12, 1)` while preserving the existing
`cuda_scaled_mm` fast-path behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7847ad05-ea4f-4f26-bef8-b7f46414632c

📥 Commits

Reviewing files that changed from the base of the PR and between edf63e8 and 581fb47.

📒 Files selected for processing (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py
tensorrt_llm/_torch/modules/linear.py

Simplify the chained boolean in Linear.enable_cuda_core to the same 'capability in (...)' pattern used in the auto_deploy quant op, per review feedback. Signed-off-by: souvikDevloper <gshsouvik01@gmail.com>

souvikDevloper · 2026-07-03T18:35:09Z

@hartsock , @mojombo hey guys take a look over it

souvikDevloper requested review from a team as code owners July 3, 2026 14:52

souvikDevloper requested review from bmarimuthu-nv and mikeiovine July 3, 2026 14:52

github-actions Bot assigned souvikDevloper Jul 3, 2026

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

Souvikalp and others added 2 commits July 3, 2026 20:30

[None][fix] Use membership check for cuda core capability allowlist

b2beab9

Simplify the chained boolean in Linear.enable_cuda_core to the same 'capability in (...)' pattern used in the auto_deploy quant op, per review feedback. Signed-off-by: souvikDevloper <gshsouvik01@gmail.com>

Merge branch 'main' into fix/sm121-cuda-scaled-mm

ba74883

souvikDevloper added 2 commits July 4, 2026 07:41

Merge branch 'main' into fix/sm121-cuda-scaled-mm

2a7cab2

Merge branch 'main' into fix/sm121-cuda-scaled-mm

2a21185

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][fix] Enable cuda_scaled_mm fast path for FP8 linear on SM121#15928

[None][fix] Enable cuda_scaled_mm fast path for FP8 linear on SM121#15928
souvikDevloper wants to merge 5 commits into
NVIDIA:mainfrom
souvikDevloper:fix/sm121-cuda-scaled-mm

souvikDevloper commented Jul 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jul 3, 2026

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

souvikDevloper commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

souvikDevloper commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Test Coverage

PR Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 3, 2026

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

souvikDevloper commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

souvikDevloper commented Jul 3, 2026 •

edited

Loading