Skip to content

[Bug]: B12x NVFP4 MoE JIT fails on SM120/SM121 with CUDA 12 CuTe DSL payload #15853

Description

@mihai-chiorean

System Info

  • GPU: Blackwell SM121 / DGX Spark observed for full-model repro; SM120 target reproduced with the same FlashInfer B12x micro-kernel target override
  • Model: Qwen3 MoE NVFP4
  • Backend: PyTorch, MoE backend CUTEDSL, FlashInfer B12x NVFP4 MoE decode path
  • Relevant package state: nvidia-cutlass-dsl-libs-cu13==4.5.0 installed, but active cutlass.base_dsl.version_info.CUDA_VERSION reported CUDA 12.9 because the active CuTe DSL native payload came from the base CUDA 12 payload

Who can help?

@Tracin @Fridah-nv

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset

Reproduction

Use an NVFP4 Qwen3 MoE model on Blackwell SM121 and force the PyTorch MoE backend to CUTEDSL so TRT-LLM selects the FlashInfer/CuTe DSL B12x fused MoE path. The selector promotes to CuteDslB12xFusedMoE, then FlashInfer JIT enters the Blackwell SM12x MoE micro-kernel path and instantiates the NVFP4/MXF4 MMA atom.

The failing environment has nvidia-cutlass-dsl-libs-cu13==4.5.0 installed, but the active CuTe DSL runtime reports CUDA 12.9. In that state, the generated PTX contains internal _mma.block_scale... instructions rather than the CUDA 13 public mma.sync.aligned...kind::mxf4nvf4 spelling. ptxas rejects the generated PTX during JIT compilation.

A minimal FlashInfer B12x micro-kernel repro shows the same ptxas failure for both SM121 and an SM120 target override when the active CuTe DSL payload is CUDA 12.x. Reinstalling the CUDA 13 CuTe DSL native payload so CUDA_VERSION reports 13.1 makes the SM121 B12x path compile and run.

Expected behavior

TRT-LLM should not select the B12x CuTe DSL/FlashInfer NVFP4 MoE path when the active CuTe DSL native payload is CUDA 12.x or unavailable, because that combination JIT-compiles invalid PTX for the B12x NVFP4 MMA atom.

When the active CuTe DSL native payload reports CUDA 13 or newer, TRT-LLM should allow B12x on supported SM12x targets.

Actual behavior

With CUDA 12.x active in CuTe DSL, initialization fails during FlashInfer/CuTe DSL JIT with ptxas errors similar to:

ptxas application ptx input, line ...; error : Unexpected instruction types specified for `_mma`
DSLRuntimeError: ICE
RuntimeError: Executor worker died during initialization

This happens after TRT-LLM has already selected CuteDslB12xFusedMoE, so the process fails during model initialization instead of falling back to a safe MoE backend.

Root cause hypothesis

CUDA 12.x CuTe DSL lowers the FlashInfer B12x NVFP4/MXF4 MMA atom to an internal _mma.block_scale... PTX form that the active ptxas rejects. CUDA 13.x CuTe DSL lowers the same operation to the public kind::mxf4nvf4 opcode form, which compiles successfully.

The immediate TRT-LLM fix is a runtime capability gate: for B12x on supported SM12x targets, require the active CuTe DSL native payload to report CUDA 13 or newer; otherwise fall back to CUTLASS with an actionable warning.

Validation

  • Reinstalling the CUDA 13 CuTe DSL native payload made CUDA_VERSION report CUDA 13.1.
  • Full-model Qwen3 NVFP4 on SM121 with CuteDslB12xFusedMoE selected initialized successfully, generated coherent output, and improved throughput versus the CUTLASS MoE path.
  • MTP was separately verified active on the CUDA 13 B12x path via iteration stats and instantiated model modules; no MTP code change is required for the B12x gate itself.

Related PR

Before submitting a new issue...

  • I searched existing issues for B12x, SM121, DGX Spark, CuTe DSL, FlashInfer, _mma, and block_scale and did not find this exact failure mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Pytorch<NV>Pytorch backend related issues

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions