tiny support glm routing #2313

b8zhong · 2026-01-08T21:13:07Z

📌 Description

Recently, we want to use flashinfer_trtllm for GLM in SGLang.

flashinfer_cutlass

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:11<00:00,  9.99it/s]
Accuracy: 0.944
Invalid: 0.000
Latency: 132.481 s
Output throughput: 1076.758 token/s

flashinfer_trtllm

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319
Downloading from https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl to /tmp/test.jsonl
/tmp/test.jsonl: 732kB [00:00, 34.9MB/s]                                                                                                                                                                                                                              
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:30<00:00, 14.51it/s]
Accuracy: 0.948
Invalid: 0.000
Latency: 91.291 s
Output throughput: 1535.949 token/s

Fix #2168

Thanks @divchenko for providing this!

pytest tests/moe/test_trtllm_gen_fused_moe.py -k GLM4_MoE -v
======================================================================================================================== test session starts =========================================================================================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /sgl-workspace/flashinfer
configfile: pytest.ini
plugins: anyio-4.12.0, typeguard-4.4.4
collected 3710 items / 3170 deselected / 540 selected        
...
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-512-1024-8] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 8)                                                                                          [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-512-1024-768] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 768)                                                                                      [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-512-1024-3072] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 3072)                                                                                    [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-384-1024-8] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 8)                                                                                          [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-384-1024-768] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 768)                                                                                      [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-384-1024-3072] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 3072)                                                                                    [100%]

==================================================================================================== 42 passed, 498 skipped, 3170 deselected in 200.39s (0:03:20) ====================================================================================================

Summary by CodeRabbit

Bug Fixes
- Improved handling of invalid routing scores: invalid values now behave as negative infinity, ensuring correct ordering and compatibility with GLM-style routing and biased routing.
Tests
- Added a DeepSeekV3 routing test case covering larger expert configurations (160 experts) with multiple MoE implementation variants and intermediate sizes.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-08T21:13:21Z

📝 Walkthrough

Walkthrough

Replaced the DeepSeek routing kernel's invalid-score sentinel with negative infinity semantics and updated comments; added a parameterized GLM4_MoE DeepSeekV3 test case exercising routed_scaling and routing_bias.

Changes

Cohort / File(s)	Summary
Kernel routing logic `csrc/trtllm_fused_moe_routing_deepseek.cu`	Replaced the invalid-score sentinel value with negative infinity (via `-float(INFINITY)`), updated comments to reflect negative-infinity semantics for GLM-style routing. No public API changes.
Test additions `tests/moe/test_trtllm_gen_fused_moe.py`	Appended a new parameterized test case "GLM4_MoE" using `DeepSeekV3` (160 experts, `top_k=8`, `padding=8`, `n_groups=1`, `top_k_groups=1`, `routed_scaling=2.5`, `has_routing_bias=true`) with compatible MOE impls and intermediate sizes; no control-flow changes.

Sequence Diagram(s)

(omitted — change is a kernel sentinel fix and test addition; no multi-component new control flow to diagram)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

fix: add DeepSeek routing for Bf16xBf16 and MxIntxBf16 TRT-LLM Gen MoE #2234: touches DeepSeek routing and wiring of routed_scaling / routing_bias into MoE paths; closely related to sentinel and test changes.

Suggested reviewers

djmmoss
cyx-6
jimmyzho
jiahanc

Poem

🐰 I hopped through kernels, nimble and spry,
I swapped -1 for -∞ with a gleeful sigh.
Bias went negative, now scores sort right,
Tests hop in place to guard day and night.
Cheers — a rabbit fixes kernels in flight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'tiny support glm routing' accurately describes the main change: adding GLM routing support to the TRT LLM MoE implementation.
Description check	✅ Passed	The PR description includes motivation with benchmarks, references the related issue #2168, and provides test results demonstrating the changes work correctly.
Linked Issues check	✅ Passed	The code changes directly address issue #2168 by replacing the invalid score sentinel with -float(INFINITY) to support negative bias values in GLM routing.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing GLM routing support: modifications to the invalid score handling in the routing kernel and a new test case for GLM4_MoE routing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-08T21:13:26Z

Summary of Changes

Hello @b8zhong, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for GLM (General Language Model) routing within the flashinfer_trtllm framework. The primary motivation is to enable the use of flashinfer_trtllm for GLM in SGLang. This is achieved by a crucial adjustment in the CUDA kernel to correctly manage negative biases inherent in GLM-style routing, ensuring robust score handling. Additionally, a dedicated test configuration for GLM4_MoE has been added to validate these changes.

Highlights

GLM Routing Support: Implemented support for GLM-style routing by adjusting the invalidScoreFloat in the DeepSeek MoE routing kernel to correctly handle negative biases, ensuring invalid scores are always considered the smallest.
GLM4_MoE Test Configuration: Added a new test parameter set for "GLM4_MoE" in the MoE routing tests, defining specific configurations for experts, top-k, padding, and routing bias to validate GLM4_MoE functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for GLM-style routing within the flashinfer_trtllm framework. The core change involves adjusting the invalidScoreFloat constant in the DeepSeek MoE routing kernel to a very large negative value. This ensures correct handling of potentially negative biases in GLM routing and also improves the robustness of other routing methods that might use negative biases. A new test case for GLM4_MoE has been added to the test suite, which is a great way to validate the new functionality. The changes are well-contained, correct, and improve the flexibility of the MoE routing kernel. The PR looks good to merge.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

csrc/trtllm_fused_moe_routing_deepseek.cu (1)
1-1: Critical: Fix clang-format violations before merge.

The pipeline failure indicates formatting issues that must be resolved:
clang-format formatting check failed. The hook modified files (diff shown) and CI exited with code 1.
Run the following to fix:
# Re-run pre-commit locally
pre-commit run --all-files

# Or run clang-format directly
clang-format -i csrc/trtllm_fused_moe_routing_deepseek.cu

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2b033 and d937511.

📒 Files selected for processing (2)

csrc/trtllm_fused_moe_routing_deepseek.cu
tests/moe/test_trtllm_gen_fused_moe.py

🧰 Additional context used

📓 Path-based instructions (2)

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test implementations should use flashinfer.utils functions (get_compute_capability, is_sm90a_supported, is_sm100a_supported, etc.) to skip tests on unsupported GPU architectures
For testing with mpirun on multi-GPU systems, use the pattern: mpirun -np <num_gpus> pytest tests/path/to/test.py::test_function
Avoid OOM (out-of-memory) errors in tests by using appropriate problem sizes - tests/conftest.py provides auto-skipping for OOM tests as a safety net but should not be relied upon

Files:

tests/moe/test_trtllm_gen_fused_moe.py

csrc/**/*.cu

📄 CodeRabbit inference engine (CLAUDE.md)

Framework bindings and PyTorch tensor handling should be implemented in csrc/ via TVM-FFI, not in include/ headers

Files:

csrc/trtllm_fused_moe_routing_deepseek.cu

🧬 Code graph analysis (1)

tests/moe/test_trtllm_gen_fused_moe.py (3)

flashinfer/fused_moe/core.py (1)

RoutingMethodType (61-75)

include/flashinfer/trtllm/fused_moe/runner.h (1)

RoutingMethodType (37-136)

tests/moe/test_dpsk_fused_moe_fp8.py (1)

FP8BlockScaleMoe (597-600)

🪛 GitHub Actions: pre-commit

csrc/trtllm_fused_moe_routing_deepseek.cu

[error] 1-1: clang-format formatting check failed. The hook modified files (diff shown) and CI exited with code 1. Re-run pre-commit locally with 'pre-commit run --all-files' or run clang-format to fix formatting.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (3)

csrc/trtllm_fused_moe_routing_deepseek.cu (2)

103-103: LGTM! Comment accurately reflects the updated logic.

The comment correctly documents that invalidScoreFloat ensures invalid values are always smaller than valid scores, which is essential for correct topk selection with negative bias.

59-61: LGTM! Sentinel change correctly handles negative bias in GLM-style routing.

The change from -1.F to -1e10F properly addresses DeepSeek's GLM routing where negative bias values can result in scores in the negative range. When sigmoid(logits) + bias where bias is negative (e.g., -2.0), valid scores can be negative, so the old sentinel -1.F would not guarantee invalid scores remain smaller than valid ones. The new value -1e10F ensures proper topk ordering regardless of bias magnitude.

Note: The standard MOE kernel (csrc/fused_moe/noAuxTcKernels.cu) uses -INFINITY because it assumes non-negative scores, but the DeepSeek variant correctly uses -1e10F to handle negative bias.

tests/moe/test_trtllm_gen_fused_moe.py (1)

2694-2709: LGTM! Test case properly validates GLM routing fix.

The new GLM4_MoE test case correctly exercises the routing fix for negative bias handling:

Uses GLM's specific configuration: n_groups=1, top_k_groups=1

Enables has_routing_bias=True to test bias handling (including negative values generated by torch.randn)

Tests with 160 experts, adding coverage for this expert count

Compatible implementations and intermediate sizes are appropriate

This test case directly validates the fix for issue #2168.

yzh119

also cc @jiahanc and @ChristinaZ for another look.

yzh119 · 2026-01-09T08:13:25Z

csrc/trtllm_fused_moe_routing_deepseek.cu

-  static constexpr float invalidScoreFloat = -1.F;
+  // note that for invalid scores, we use a very negative value:
+  // needed for GLM-style routing where bias can be negative
+  static constexpr float invalidScoreFloat = -1e10F;


is it possible to use negative zero? It's usually used as an invalid score in float.

Here, I think it would not work? (I think signed 0 is a valid representation here right and will just be 0? Correct me if I'm wrong)

model.layers.92.mlp.gate.e_score_correction_bias shape: torch.Size([160]) min: -9.6087 max: -9.2649 mean: -9.3417 ⚠️ Has 160 NEGATIVE values! model.layers.88.mlp.gate.e_score_correction_bias shape: torch.Size([160]) min: -2.0029 max: -1.8230 mean: -1.8640 ⚠️ Has 160 NEGATIVE values! model.layers.89.mlp.gate.e_score_correction_bias shape: torch.Size([160]) min: -2.1478 max: -1.9239 mean: -1.9730 ⚠️ Has 160 NEGATIVE values!

Agree, I think we can also use static constexpr float invalidScoreFloat = float{-INFINITY};

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

csrc/trtllm_fused_moe_routing_deepseek.cu (1)
61-64: Correct fix for GLM negative bias support.

Using negative infinity as the invalid score sentinel ensures it's always smaller than any valid score, regardless of bias magnitude. The comment clearly documents the rationale, which is helpful for future maintainers. Based on learnings, this aligns with the guideline to leave comments explaining special algorithmic choices in performance-critical hot paths.

Nitpick: The float() cast is redundant since INFINITY is already a float constant expression. Consider simplifying:
Optional simplification
-  static constexpr float invalidScoreFloat = -float(INFINITY);
+  static constexpr float invalidScoreFloat = -INFINITY;

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db98598 and 49c7011.

📒 Files selected for processing (1)

csrc/trtllm_fused_moe_routing_deepseek.cu

🧰 Additional context used

📓 Path-based instructions (1)

csrc/**/*.cu

📄 CodeRabbit inference engine (CLAUDE.md)

Framework bindings and PyTorch tensor handling should be implemented in csrc/ via TVM-FFI, not in include/ headers

Files:

csrc/trtllm_fused_moe_routing_deepseek.cu

🧠 Learnings (2)

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : For performance-critical hot paths, leave comments explaining special algorithmic choices and potential alternatives for future reviewers

Applied to files:

csrc/trtllm_fused_moe_routing_deepseek.cu

📚 Learning: 2025-11-12T03:35:17.583Z

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

csrc/trtllm_fused_moe_routing_deepseek.cu

🔇 Additional comments (2)

csrc/trtllm_fused_moe_routing_deepseek.cu (2)

17-18: LGTM!

The <cmath> header is correctly included to provide the INFINITY macro used for the new invalid score sentinel.

105-107: LGTM!

The updated comment accurately reflects the new semantics: with -INFINITY as the sentinel, invalid values are guaranteed to be smaller than any valid scoreBias regardless of the bias sign.

yzh119 · 2026-01-13T16:50:04Z

/bot run

flashinfer-bot · 2026-01-13T16:51:17Z

GitLab MR !242 has been created, and the CI pipeline #41653617 is currently running. I'll report back once the pipeline job completes.

yzh119

LGTM

more

d937511

b8zhong requested review from cyx-6, djmmoss, jiahanc, jimmyzho and yzh119 as code owners January 8, 2026 21:13

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

more

db98598

yzh119 reviewed Jan 9, 2026

View reviewed changes

b8zhong added 2 commits January 13, 2026 09:33

upd

8d16c01

upd

49c7011

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

yzh119 approved these changes Jan 14, 2026

View reviewed changes

yzh119 merged commit f0277fd into flashinfer-ai:main Jan 14, 2026
9 checks passed

b8zhong deleted the brayden/add-glm-routing branch January 14, 2026 17:48

b8zhong mentioned this pull request Jan 15, 2026

[Fix] GLM 4.7 + NVFP4 + MTP sgl-project/sglang#17166

Open

tiny support glm routing #2313

tiny support glm routing #2313

Conversation

b8zhong commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

b8zhong Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

ChristinaZ Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

b8zhong Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Jan 13, 2026

Uh oh!

flashinfer-bot commented Jan 13, 2026

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

b8zhong commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading