Skip to content

Conversation

@b8zhong
Copy link
Contributor

@b8zhong b8zhong commented Jan 8, 2026

📌 Description

Recently, we want to use flashinfer_trtllm for GLM in SGLang.

flashinfer_cutlass

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:11<00:00,  9.99it/s]
Accuracy: 0.944
Invalid: 0.000
Latency: 132.481 s
Output throughput: 1076.758 token/s

flashinfer_trtllm

python3 benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319
Downloading from https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl to /tmp/test.jsonl
/tmp/test.jsonl: 732kB [00:00, 34.9MB/s]                                                                                                                                                                                                                              
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:30<00:00, 14.51it/s]
Accuracy: 0.948
Invalid: 0.000
Latency: 91.291 s
Output throughput: 1535.949 token/s

Fix #2168

Thanks @divchenko for providing this!

pytest tests/moe/test_trtllm_gen_fused_moe.py -k GLM4_MoE -v
======================================================================================================================== test session starts =========================================================================================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /sgl-workspace/flashinfer
configfile: pytest.ini
plugins: anyio-4.12.0, typeguard-4.4.4
collected 3710 items / 3170 deselected / 540 selected        
...
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-512-1024-8] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 8)                                                                                          [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-512-1024-768] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 768)                                                                                      [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-512-1024-3072] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 3072)                                                                                    [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-384-1024-8] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 8)                                                                                          [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-384-1024-768] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 768)                                                                                      [ 99%]
tests/moe/test_trtllm_gen_fused_moe.py::test_deepseekv3_routing[GeGlu-Shuffled_BlockMajorK-GLM4_MoE-Bf16xBf16-384-1024-3072] SKIPPED (Incompatible: BF16Moe + 1 + 2 + 3072)                                                                                    [100%]

==================================================================================================== 42 passed, 498 skipped, 3170 deselected in 200.39s (0:03:20) ====================================================================================================

Summary by CodeRabbit

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of invalid routing scores: invalid values now behave as negative infinity, ensuring correct ordering and compatibility with GLM-style routing and biased routing.
  • Tests

    • Added a DeepSeekV3 routing test case covering larger expert configurations (160 experts) with multiple MoE implementation variants and intermediate sizes.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

Replaced the DeepSeek routing kernel's invalid-score sentinel with negative infinity semantics and updated comments; added a parameterized GLM4_MoE DeepSeekV3 test case exercising routed_scaling and routing_bias.

Changes

Cohort / File(s) Summary
Kernel routing logic
csrc/trtllm_fused_moe_routing_deepseek.cu
Replaced the invalid-score sentinel value with negative infinity (via -float(INFINITY)), updated comments to reflect negative-infinity semantics for GLM-style routing. No public API changes.
Test additions
tests/moe/test_trtllm_gen_fused_moe.py
Appended a new parameterized test case "GLM4_MoE" using DeepSeekV3 (160 experts, top_k=8, padding=8, n_groups=1, top_k_groups=1, routed_scaling=2.5, has_routing_bias=true) with compatible MOE impls and intermediate sizes; no control-flow changes.

Sequence Diagram(s)

(omitted — change is a kernel sentinel fix and test addition; no multi-component new control flow to diagram)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested reviewers

  • djmmoss
  • cyx-6
  • jimmyzho
  • jiahanc

Poem

🐰 I hopped through kernels, nimble and spry,
I swapped -1 for -∞ with a gleeful sigh.
Bias went negative, now scores sort right,
Tests hop in place to guard day and night.
Cheers — a rabbit fixes kernels in flight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'tiny support glm routing' accurately describes the main change: adding GLM routing support to the TRT LLM MoE implementation.
Description check ✅ Passed The PR description includes motivation with benchmarks, references the related issue #2168, and provides test results demonstrating the changes work correctly.
Linked Issues check ✅ Passed The code changes directly address issue #2168 by replacing the invalid score sentinel with -float(INFINITY) to support negative bias values in GLM routing.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing GLM routing support: modifications to the invalid score handling in the routing kernel and a new test case for GLM4_MoE routing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @b8zhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for GLM (General Language Model) routing within the flashinfer_trtllm framework. The primary motivation is to enable the use of flashinfer_trtllm for GLM in SGLang. This is achieved by a crucial adjustment in the CUDA kernel to correctly manage negative biases inherent in GLM-style routing, ensuring robust score handling. Additionally, a dedicated test configuration for GLM4_MoE has been added to validate these changes.

Highlights

  • GLM Routing Support: Implemented support for GLM-style routing by adjusting the invalidScoreFloat in the DeepSeek MoE routing kernel to correctly handle negative biases, ensuring invalid scores are always considered the smallest.
  • GLM4_MoE Test Configuration: Added a new test parameter set for "GLM4_MoE" in the MoE routing tests, defining specific configurations for experts, top-k, padding, and routing bias to validate GLM4_MoE functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for GLM-style routing within the flashinfer_trtllm framework. The core change involves adjusting the invalidScoreFloat constant in the DeepSeek MoE routing kernel to a very large negative value. This ensures correct handling of potentially negative biases in GLM routing and also improves the robustness of other routing methods that might use negative biases. A new test case for GLM4_MoE has been added to the test suite, which is a great way to validate the new functionality. The changes are well-contained, correct, and improve the flexibility of the MoE routing kernel. The PR looks good to merge.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)

1-1: Critical: Fix clang-format violations before merge.

The pipeline failure indicates formatting issues that must be resolved:

clang-format formatting check failed. The hook modified files (diff shown) and CI exited with code 1.

Run the following to fix:

# Re-run pre-commit locally
pre-commit run --all-files

# Or run clang-format directly
clang-format -i csrc/trtllm_fused_moe_routing_deepseek.cu
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2b033 and d937511.

📒 Files selected for processing (2)
  • csrc/trtllm_fused_moe_routing_deepseek.cu
  • tests/moe/test_trtllm_gen_fused_moe.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test implementations should use flashinfer.utils functions (get_compute_capability, is_sm90a_supported, is_sm100a_supported, etc.) to skip tests on unsupported GPU architectures
For testing with mpirun on multi-GPU systems, use the pattern: mpirun -np <num_gpus> pytest tests/path/to/test.py::test_function
Avoid OOM (out-of-memory) errors in tests by using appropriate problem sizes - tests/conftest.py provides auto-skipping for OOM tests as a safety net but should not be relied upon

Files:

  • tests/moe/test_trtllm_gen_fused_moe.py
csrc/**/*.cu

📄 CodeRabbit inference engine (CLAUDE.md)

Framework bindings and PyTorch tensor handling should be implemented in csrc/ via TVM-FFI, not in include/ headers

Files:

  • csrc/trtllm_fused_moe_routing_deepseek.cu
🧬 Code graph analysis (1)
tests/moe/test_trtllm_gen_fused_moe.py (3)
flashinfer/fused_moe/core.py (1)
  • RoutingMethodType (61-75)
include/flashinfer/trtllm/fused_moe/runner.h (1)
  • RoutingMethodType (37-136)
tests/moe/test_dpsk_fused_moe_fp8.py (1)
  • FP8BlockScaleMoe (597-600)
🪛 GitHub Actions: pre-commit
csrc/trtllm_fused_moe_routing_deepseek.cu

[error] 1-1: clang-format formatting check failed. The hook modified files (diff shown) and CI exited with code 1. Re-run pre-commit locally with 'pre-commit run --all-files' or run clang-format to fix formatting.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (3)
csrc/trtllm_fused_moe_routing_deepseek.cu (2)

103-103: LGTM! Comment accurately reflects the updated logic.

The comment correctly documents that invalidScoreFloat ensures invalid values are always smaller than valid scores, which is essential for correct topk selection with negative bias.


59-61: LGTM! Sentinel change correctly handles negative bias in GLM-style routing.

The change from -1.F to -1e10F properly addresses DeepSeek's GLM routing where negative bias values can result in scores in the negative range. When sigmoid(logits) + bias where bias is negative (e.g., -2.0), valid scores can be negative, so the old sentinel -1.F would not guarantee invalid scores remain smaller than valid ones. The new value -1e10F ensures proper topk ordering regardless of bias magnitude.

Note: The standard MOE kernel (csrc/fused_moe/noAuxTcKernels.cu) uses -INFINITY because it assumes non-negative scores, but the DeepSeek variant correctly uses -1e10F to handle negative bias.

tests/moe/test_trtllm_gen_fused_moe.py (1)

2694-2709: LGTM! Test case properly validates GLM routing fix.

The new GLM4_MoE test case correctly exercises the routing fix for negative bias handling:

  • Uses GLM's specific configuration: n_groups=1, top_k_groups=1
  • Enables has_routing_bias=True to test bias handling (including negative values generated by torch.randn)
  • Tests with 160 experts, adding coverage for this expert count
  • Compatible implementations and intermediate sizes are appropriate

This test case directly validates the fix for issue #2168.

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @jiahanc and @ChristinaZ for another look.

static constexpr float invalidScoreFloat = -1.F;
// note that for invalid scores, we use a very negative value:
// needed for GLM-style routing where bias can be negative
static constexpr float invalidScoreFloat = -1e10F;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to use negative zero? It's usually used as an invalid score in float.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think it would not work? (I think signed 0 is a valid representation here right and will just be 0? Correct me if I'm wrong)

model.layers.92.mlp.gate.e_score_correction_bias
  shape: torch.Size([160])
  min: -9.6087
  max: -9.2649
  mean: -9.3417
  ⚠️  Has 160 NEGATIVE values!

model.layers.88.mlp.gate.e_score_correction_bias
  shape: torch.Size([160])
  min: -2.0029
  max: -1.8230
  mean: -1.8640
  ⚠️  Has 160 NEGATIVE values!

model.layers.89.mlp.gate.e_score_correction_bias
  shape: torch.Size([160])
  min: -2.1478
  max: -1.9239
  mean: -1.9730
  ⚠️  Has 160 NEGATIVE values!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I think we can also use static constexpr float invalidScoreFloat = float{-INFINITY};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)

61-64: Correct fix for GLM negative bias support.

Using negative infinity as the invalid score sentinel ensures it's always smaller than any valid score, regardless of bias magnitude. The comment clearly documents the rationale, which is helpful for future maintainers. Based on learnings, this aligns with the guideline to leave comments explaining special algorithmic choices in performance-critical hot paths.

Nitpick: The float() cast is redundant since INFINITY is already a float constant expression. Consider simplifying:

Optional simplification
-  static constexpr float invalidScoreFloat = -float(INFINITY);
+  static constexpr float invalidScoreFloat = -INFINITY;
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db98598 and 49c7011.

📒 Files selected for processing (1)
  • csrc/trtllm_fused_moe_routing_deepseek.cu
🧰 Additional context used
📓 Path-based instructions (1)
csrc/**/*.cu

📄 CodeRabbit inference engine (CLAUDE.md)

Framework bindings and PyTorch tensor handling should be implemented in csrc/ via TVM-FFI, not in include/ headers

Files:

  • csrc/trtllm_fused_moe_routing_deepseek.cu
🧠 Learnings (2)
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : For performance-critical hot paths, leave comments explaining special algorithmic choices and potential alternatives for future reviewers

Applied to files:

  • csrc/trtllm_fused_moe_routing_deepseek.cu
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

  • csrc/trtllm_fused_moe_routing_deepseek.cu
🔇 Additional comments (2)
csrc/trtllm_fused_moe_routing_deepseek.cu (2)

17-18: LGTM!

The <cmath> header is correctly included to provide the INFINITY macro used for the new invalid score sentinel.


105-107: LGTM!

The updated comment accurately reflects the new semantics: with -INFINITY as the sentinel, invalid values are guaranteed to be smaller than any valid scoreBias regardless of the bias sign.

@yzh119
Copy link
Collaborator

yzh119 commented Jan 13, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !242 has been created, and the CI pipeline #41653617 is currently running. I'll report back once the pipeline job completes.

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yzh119 yzh119 merged commit f0277fd into flashinfer-ai:main Jan 14, 2026
9 checks passed
@b8zhong b8zhong deleted the brayden/add-glm-routing branch January 14, 2026 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Negative bias is not allowed in DeepSeek routing for TRT LLM MoE

4 participants