System Info
- GPU: Blackwell SM121 / DGX Spark observed for full-model repro; SM120 target reproduced with the same FlashInfer B12x micro-kernel target override
- Model: Qwen3 MoE NVFP4
- Backend: PyTorch, MoE backend CUTEDSL, FlashInfer B12x NVFP4 MoE decode path
- Relevant package state:
nvidia-cutlass-dsl-libs-cu13==4.5.0 installed, but active cutlass.base_dsl.version_info.CUDA_VERSION reported CUDA 12.9 because the active CuTe DSL native payload came from the base CUDA 12 payload
Who can help?
@Tracin @Fridah-nv
Information
Tasks
Reproduction
Use an NVFP4 Qwen3 MoE model on Blackwell SM121 and force the PyTorch MoE backend to CUTEDSL so TRT-LLM selects the FlashInfer/CuTe DSL B12x fused MoE path. The selector promotes to CuteDslB12xFusedMoE, then FlashInfer JIT enters the Blackwell SM12x MoE micro-kernel path and instantiates the NVFP4/MXF4 MMA atom.
The failing environment has nvidia-cutlass-dsl-libs-cu13==4.5.0 installed, but the active CuTe DSL runtime reports CUDA 12.9. In that state, the generated PTX contains internal _mma.block_scale... instructions rather than the CUDA 13 public mma.sync.aligned...kind::mxf4nvf4 spelling. ptxas rejects the generated PTX during JIT compilation.
A minimal FlashInfer B12x micro-kernel repro shows the same ptxas failure for both SM121 and an SM120 target override when the active CuTe DSL payload is CUDA 12.x. Reinstalling the CUDA 13 CuTe DSL native payload so CUDA_VERSION reports 13.1 makes the SM121 B12x path compile and run.
Expected behavior
TRT-LLM should not select the B12x CuTe DSL/FlashInfer NVFP4 MoE path when the active CuTe DSL native payload is CUDA 12.x or unavailable, because that combination JIT-compiles invalid PTX for the B12x NVFP4 MMA atom.
When the active CuTe DSL native payload reports CUDA 13 or newer, TRT-LLM should allow B12x on supported SM12x targets.
Actual behavior
With CUDA 12.x active in CuTe DSL, initialization fails during FlashInfer/CuTe DSL JIT with ptxas errors similar to:
ptxas application ptx input, line ...; error : Unexpected instruction types specified for `_mma`
DSLRuntimeError: ICE
RuntimeError: Executor worker died during initialization
This happens after TRT-LLM has already selected CuteDslB12xFusedMoE, so the process fails during model initialization instead of falling back to a safe MoE backend.
Root cause hypothesis
CUDA 12.x CuTe DSL lowers the FlashInfer B12x NVFP4/MXF4 MMA atom to an internal _mma.block_scale... PTX form that the active ptxas rejects. CUDA 13.x CuTe DSL lowers the same operation to the public kind::mxf4nvf4 opcode form, which compiles successfully.
The immediate TRT-LLM fix is a runtime capability gate: for B12x on supported SM12x targets, require the active CuTe DSL native payload to report CUDA 13 or newer; otherwise fall back to CUTLASS with an actionable warning.
Validation
- Reinstalling the CUDA 13 CuTe DSL native payload made
CUDA_VERSION report CUDA 13.1.
- Full-model Qwen3 NVFP4 on SM121 with
CuteDslB12xFusedMoE selected initialized successfully, generated coherent output, and improved throughput versus the CUTLASS MoE path.
- MTP was separately verified active on the CUDA 13 B12x path via iteration stats and instantiated model modules; no MTP code change is required for the B12x gate itself.
Related PR
Before submitting a new issue...
System Info
nvidia-cutlass-dsl-libs-cu13==4.5.0installed, but activecutlass.base_dsl.version_info.CUDA_VERSIONreported CUDA 12.9 because the active CuTe DSL native payload came from the base CUDA 12 payloadWho can help?
@Tracin @Fridah-nv
Information
Tasks
Reproduction
Use an NVFP4 Qwen3 MoE model on Blackwell SM121 and force the PyTorch MoE backend to CUTEDSL so TRT-LLM selects the FlashInfer/CuTe DSL B12x fused MoE path. The selector promotes to
CuteDslB12xFusedMoE, then FlashInfer JIT enters the Blackwell SM12x MoE micro-kernel path and instantiates the NVFP4/MXF4 MMA atom.The failing environment has
nvidia-cutlass-dsl-libs-cu13==4.5.0installed, but the active CuTe DSL runtime reports CUDA 12.9. In that state, the generated PTX contains internal_mma.block_scale...instructions rather than the CUDA 13 publicmma.sync.aligned...kind::mxf4nvf4spelling. ptxas rejects the generated PTX during JIT compilation.A minimal FlashInfer B12x micro-kernel repro shows the same ptxas failure for both SM121 and an SM120 target override when the active CuTe DSL payload is CUDA 12.x. Reinstalling the CUDA 13 CuTe DSL native payload so
CUDA_VERSIONreports 13.1 makes the SM121 B12x path compile and run.Expected behavior
TRT-LLM should not select the B12x CuTe DSL/FlashInfer NVFP4 MoE path when the active CuTe DSL native payload is CUDA 12.x or unavailable, because that combination JIT-compiles invalid PTX for the B12x NVFP4 MMA atom.
When the active CuTe DSL native payload reports CUDA 13 or newer, TRT-LLM should allow B12x on supported SM12x targets.
Actual behavior
With CUDA 12.x active in CuTe DSL, initialization fails during FlashInfer/CuTe DSL JIT with ptxas errors similar to:
This happens after TRT-LLM has already selected
CuteDslB12xFusedMoE, so the process fails during model initialization instead of falling back to a safe MoE backend.Root cause hypothesis
CUDA 12.x CuTe DSL lowers the FlashInfer B12x NVFP4/MXF4 MMA atom to an internal
_mma.block_scale...PTX form that the active ptxas rejects. CUDA 13.x CuTe DSL lowers the same operation to the publickind::mxf4nvf4opcode form, which compiles successfully.The immediate TRT-LLM fix is a runtime capability gate: for B12x on supported SM12x targets, require the active CuTe DSL native payload to report CUDA 13 or newer; otherwise fall back to CUTLASS with an actionable warning.
Validation
CUDA_VERSIONreport CUDA 13.1.CuteDslB12xFusedMoEselected initialized successfully, generated coherent output, and improved throughput versus the CUTLASS MoE path.Related PR
Before submitting a new issue...
_mma, andblock_scaleand did not find this exact failure mode.