Skip to content

[None][fix] Enable cuda_scaled_mm fast path for FP8 linear on SM121#15928

Open
souvikDevloper wants to merge 5 commits into
NVIDIA:mainfrom
souvikDevloper:fix/sm121-cuda-scaled-mm
Open

[None][fix] Enable cuda_scaled_mm fast path for FP8 linear on SM121#15928
souvikDevloper wants to merge 5 commits into
NVIDIA:mainfrom
souvikDevloper:fix/sm121-cuda-scaled-mm

Conversation

@souvikDevloper

@souvikDevloper souvikDevloper commented Jul 3, 2026

Copy link
Copy Markdown

Description

Fixes #15673

On SM121 devices (GB10 / DGX Spark), the FP8 linear cuda_scaled_mm fast path for small batch sizes (m <= 8) is silently disabled, because the two enable_cuda_core allowlists are hard-coded to SM89 and SM120 only. As a result, decode-time GEMMs on SM121 always fall back to the slower generic cublas_scaled_mm path, with no warning or log message.

The comment at the first site ("enable cuda core for sm89 and sm120") suggests SM121 was left out unintentionally rather than deliberately excluded — SM121 supports the same fast path as SM120.

Changes

1. tensorrt_llm/_torch/modules/linear.py — used by the PyTorch backend Linear module:

# before
# enable cuda core for sm89 and sm120
self.enable_cuda_core = (capability[0] == 8 and capability[1] == 9) \
    or (capability[0] == 12 and capability[1] == 0)

# after
# enable cuda core for sm89, sm120 and sm121
self.enable_cuda_core = (capability[0] == 8 and capability[1] == 9) \
    or (capability[0] == 12 and capability[1] == 0) \
    or (capability[0] == 12 and capability[1] == 1)

2. tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py — same gate in the AutoDeploy FP8 prequant linear op:

# before
enable_cuda_core = capability == (8, 9) or capability == (12, 0)

# after
enable_cuda_core = capability in ((8, 9), (12, 0), (12, 1))

With enable_cuda_core now true on SM121, batches with m <= 8 take the optimized torch.ops.trtllm.cuda_scaled_mm path instead of falling through to cublas_scaled_mm, matching SM120 behavior.

Test Coverage

No new tests added: the change only extends an existing device-capability allowlist, and the affected path is already exercised by the existing FP8 linear unit tests on supported hardware. Verifying the SM121 branch end-to-end requires GB10 / DGX Spark hardware.

PR Checklist

  • PR title follows the required format: [None][fix] Enable cuda_scaled_mm fast path for FP8 linear on SM121
  • PR description explains what and why
  • Commit is DCO signed-off
  • Test coverage (existing tests; SM121-specific verification requires GB10 hardware)

Summary by CodeRabbit

  • Bug Fixes
    • Expanded GPU compatibility for CUDA-core acceleration on supported hardware, including additional compute capability variants.
    • Improved automatic enablement so linear layers and quantized operations can use the optimized path on more devices.

The FP8 linear cuda_scaled_mm fast path for small batch sizes (m <= 8)
was gated by hard-coded allowlists covering only SM89 and SM120,
silently falling back to the slower cublas path on SM121 (GB10 /
DGX Spark) even though the hardware supports it. Add SM121 to both
allowlists.

Fixes NVIDIA#15673

Signed-off-by: souvikDevloper <gshsouvik01@gmail.com>
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The compute-capability allowlist controlling the CUDA core fast path (enable_cuda_core) is extended in both linear.py and quant.py to include SM121 (12, 1), in addition to the existing SM89 (8, 9) and SM120 (12, 0) capabilities.

Changes

SM121 Support in cuda_scaled_mm Fast Path

Layer / File(s) Summary
Extend CUDA core capability allowlist
tensorrt_llm/_torch/modules/linear.py, tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py
enable_cuda_core checks in both Linear.__init__ and _trtllm_fp8_prequant_linear_core now also evaluate true for compute capability (12, 1), alongside (8, 9) and (12, 0).

Estimated code review effort: 1 (Trivial) | ~3 minutes

Related issues: #15673

Suggested labels: bug, low-risk

Suggested reviewers: Tracin

🐰 A whisker-twitch, a GPU cheer,
SM121 joins the frontier,
One-two-one, no longer left behind,
Cuda cores now fully aligned!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly names the FP8 cuda_scaled_mm fast-path fix for SM121 and matches the main code change.
Description check ✅ Passed The description explains the bug, the two code changes, and test coverage, and it follows the required template closely.
Linked Issues check ✅ Passed The PR updates both allowlists to include SM121, satisfying issue #15673's requirement to enable the low-batch fast path.
Out of Scope Changes check ✅ Passed The change set stays narrowly focused on the SM121 allowlist fix and introduces no obvious unrelated code changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/linear.py (1)

3272-3279: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Correctly extends allowlist to SM121.

The added (capability[0] == 12 and capability[1] == 1) clause correctly enables the cuda_scaled_mm fast path for SM121, matching the PR objective and the parallel change in quant.py.

For consistency with quant.py's more concise capability in ((8, 9), (12, 0), (12, 1)) pattern, consider simplifying this chained boolean expression, though this is purely stylistic.

♻️ Optional simplification for consistency with quant.py
         self.enable_cuda_core = False
         if torch.cuda.is_available():
             capability = torch.cuda.get_device_capability(
                 torch.device('cuda:0'))
-            # enable cuda core for sm89, sm120 and sm121
-            self.enable_cuda_core = (capability[0] == 8 and capability[1] == 9) \
-                or (capability[0] == 12 and capability[1] == 0) \
-                or (capability[0] == 12 and capability[1] == 1)
+            # enable cuda core for sm89, sm120 and sm121
+            self.enable_cuda_core = capability in ((8, 9), (12, 0), (12, 1))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/linear.py` around lines 3272 - 3279, The SM121
allowlist in the cuda core gating logic is already correct, but the boolean
chain in the `enable_cuda_core` assignment inside the linear module should be
simplified for consistency with `quant.py`. Update the capability check in
`linear.py`’s CUDA availability block to use the same concise membership-style
pattern as `quant.py`, keeping the allowlist for `(8, 9)`, `(12, 0)`, and `(12,
1)` while preserving the existing `cuda_scaled_mm` fast-path behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tensorrt_llm/_torch/modules/linear.py`:
- Around line 3272-3279: The SM121 allowlist in the cuda core gating logic is
already correct, but the boolean chain in the `enable_cuda_core` assignment
inside the linear module should be simplified for consistency with `quant.py`.
Update the capability check in `linear.py`’s CUDA availability block to use the
same concise membership-style pattern as `quant.py`, keeping the allowlist for
`(8, 9)`, `(12, 0)`, and `(12, 1)` while preserving the existing
`cuda_scaled_mm` fast-path behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7847ad05-ea4f-4f26-bef8-b7f46414632c

📥 Commits

Reviewing files that changed from the base of the PR and between edf63e8 and 581fb47.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py
  • tensorrt_llm/_torch/modules/linear.py

Souvikalp and others added 2 commits July 3, 2026 20:30
Simplify the chained boolean in Linear.enable_cuda_core to the same 'capability in (...)' pattern used in the auto_deploy quant op, per review feedback.

Signed-off-by: souvikDevloper <gshsouvik01@gmail.com>
@souvikDevloper

Copy link
Copy Markdown
Author

@hartsock , @mojombo hey guys take a look over it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: FP8 linear cuda_scaled_mm fast path silently disabled on SM121 (DGX Spark GB10)

2 participants