-
Notifications
You must be signed in to change notification settings - Fork 656
tiny support glm routing #2313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tiny support glm routing #2313
Conversation
📝 WalkthroughWalkthroughReplaced the DeepSeek routing kernel's invalid-score sentinel with negative infinity semantics and updated comments; added a parameterized GLM4_MoE DeepSeekV3 test case exercising routed_scaling and routing_bias. Changes
Sequence Diagram(s)(omitted — change is a kernel sentinel fix and test addition; no multi-component new control flow to diagram) Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @b8zhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces initial support for GLM (General Language Model) routing within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for GLM-style routing within the flashinfer_trtllm framework. The core change involves adjusting the invalidScoreFloat constant in the DeepSeek MoE routing kernel to a very large negative value. This ensures correct handling of potentially negative biases in GLM routing and also improves the robustness of other routing methods that might use negative biases. A new test case for GLM4_MoE has been added to the test suite, which is a great way to validate the new functionality. The changes are well-contained, correct, and improve the flexibility of the MoE routing kernel. The PR looks good to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)
1-1: Critical: Fix clang-format violations before merge.The pipeline failure indicates formatting issues that must be resolved:
clang-format formatting check failed. The hook modified files (diff shown) and CI exited with code 1.Run the following to fix:
# Re-run pre-commit locally pre-commit run --all-files # Or run clang-format directly clang-format -i csrc/trtllm_fused_moe_routing_deepseek.cu
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
csrc/trtllm_fused_moe_routing_deepseek.cutests/moe/test_trtllm_gen_fused_moe.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Test implementations should useflashinfer.utilsfunctions (get_compute_capability,is_sm90a_supported,is_sm100a_supported, etc.) to skip tests on unsupported GPU architectures
For testing withmpirunon multi-GPU systems, use the pattern:mpirun -np <num_gpus> pytest tests/path/to/test.py::test_function
Avoid OOM (out-of-memory) errors in tests by using appropriate problem sizes -tests/conftest.pyprovides auto-skipping for OOM tests as a safety net but should not be relied upon
Files:
tests/moe/test_trtllm_gen_fused_moe.py
csrc/**/*.cu
📄 CodeRabbit inference engine (CLAUDE.md)
Framework bindings and PyTorch tensor handling should be implemented in
csrc/via TVM-FFI, not ininclude/headers
Files:
csrc/trtllm_fused_moe_routing_deepseek.cu
🧬 Code graph analysis (1)
tests/moe/test_trtllm_gen_fused_moe.py (3)
flashinfer/fused_moe/core.py (1)
RoutingMethodType(61-75)include/flashinfer/trtllm/fused_moe/runner.h (1)
RoutingMethodType(37-136)tests/moe/test_dpsk_fused_moe_fp8.py (1)
FP8BlockScaleMoe(597-600)
🪛 GitHub Actions: pre-commit
csrc/trtllm_fused_moe_routing_deepseek.cu
[error] 1-1: clang-format formatting check failed. The hook modified files (diff shown) and CI exited with code 1. Re-run pre-commit locally with 'pre-commit run --all-files' or run clang-format to fix formatting.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (3)
csrc/trtllm_fused_moe_routing_deepseek.cu (2)
103-103: LGTM! Comment accurately reflects the updated logic.The comment correctly documents that
invalidScoreFloatensures invalid values are always smaller than valid scores, which is essential for correct topk selection with negative bias.
59-61: LGTM! Sentinel change correctly handles negative bias in GLM-style routing.The change from
-1.Fto-1e10Fproperly addresses DeepSeek's GLM routing where negative bias values can result in scores in the negative range. Whensigmoid(logits) + biaswhere bias is negative (e.g., -2.0), valid scores can be negative, so the old sentinel-1.Fwould not guarantee invalid scores remain smaller than valid ones. The new value-1e10Fensures proper topk ordering regardless of bias magnitude.Note: The standard MOE kernel (
csrc/fused_moe/noAuxTcKernels.cu) uses-INFINITYbecause it assumes non-negative scores, but the DeepSeek variant correctly uses-1e10Fto handle negative bias.tests/moe/test_trtllm_gen_fused_moe.py (1)
2694-2709: LGTM! Test case properly validates GLM routing fix.The new
GLM4_MoEtest case correctly exercises the routing fix for negative bias handling:
- Uses GLM's specific configuration:
n_groups=1,top_k_groups=1- Enables
has_routing_bias=Trueto test bias handling (including negative values generated bytorch.randn)- Tests with 160 experts, adding coverage for this expert count
- Compatible implementations and intermediate sizes are appropriate
This test case directly validates the fix for issue #2168.
yzh119
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @jiahanc and @ChristinaZ for another look.
| static constexpr float invalidScoreFloat = -1.F; | ||
| // note that for invalid scores, we use a very negative value: | ||
| // needed for GLM-style routing where bias can be negative | ||
| static constexpr float invalidScoreFloat = -1e10F; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to use negative zero? It's usually used as an invalid score in float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I think it would not work? (I think signed 0 is a valid representation here right and will just be 0? Correct me if I'm wrong)
model.layers.92.mlp.gate.e_score_correction_bias
shape: torch.Size([160])
min: -9.6087
max: -9.2649
mean: -9.3417
⚠️ Has 160 NEGATIVE values!
model.layers.88.mlp.gate.e_score_correction_bias
shape: torch.Size([160])
min: -2.0029
max: -1.8230
mean: -1.8640
⚠️ Has 160 NEGATIVE values!
model.layers.89.mlp.gate.e_score_correction_bias
shape: torch.Size([160])
min: -2.1478
max: -1.9239
mean: -1.9730
⚠️ Has 160 NEGATIVE values!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I think we can also use static constexpr float invalidScoreFloat = float{-INFINITY};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)
61-64: Correct fix for GLM negative bias support.Using negative infinity as the invalid score sentinel ensures it's always smaller than any valid score, regardless of bias magnitude. The comment clearly documents the rationale, which is helpful for future maintainers. Based on learnings, this aligns with the guideline to leave comments explaining special algorithmic choices in performance-critical hot paths.
Nitpick: The
float()cast is redundant sinceINFINITYis already a float constant expression. Consider simplifying:Optional simplification
- static constexpr float invalidScoreFloat = -float(INFINITY); + static constexpr float invalidScoreFloat = -INFINITY;
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
csrc/trtllm_fused_moe_routing_deepseek.cu
🧰 Additional context used
📓 Path-based instructions (1)
csrc/**/*.cu
📄 CodeRabbit inference engine (CLAUDE.md)
Framework bindings and PyTorch tensor handling should be implemented in
csrc/via TVM-FFI, not ininclude/headers
Files:
csrc/trtllm_fused_moe_routing_deepseek.cu
🧠 Learnings (2)
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : For performance-critical hot paths, leave comments explaining special algorithmic choices and potential alternatives for future reviewers
Applied to files:
csrc/trtllm_fused_moe_routing_deepseek.cu
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.
Applied to files:
csrc/trtllm_fused_moe_routing_deepseek.cu
🔇 Additional comments (2)
csrc/trtllm_fused_moe_routing_deepseek.cu (2)
17-18: LGTM!The
<cmath>header is correctly included to provide theINFINITYmacro used for the new invalid score sentinel.
105-107: LGTM!The updated comment accurately reflects the new semantics: with
-INFINITYas the sentinel, invalid values are guaranteed to be smaller than any validscoreBiasregardless of the bias sign.
|
/bot run |
yzh119
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
📌 Description
Recently, we want to use
flashinfer_trtllmfor GLM in SGLang.flashinfer_cutlassflashinfer_trtllmFix #2168
Thanks @divchenko for providing this!
Summary by CodeRabbit
Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.