-
-
Notifications
You must be signed in to change notification settings - Fork 12.4k
[Kernel] Porting triton_kernels for FusedMoE #18595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@zyongye can you implement this using the framework described in |
|
I didn't realize that these kernels were so incompatible with the current modular architecture. I think we should actually leave the triton kernels as a standalone MoE alternative until we can come up with a more general framework to handle them. |
…still has some cuda illegal memory access bug
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@zyongye I'm porting triton_kernels to SGLang. Currently I managed to replace SGLang fused_moe with the triton kernel and added unit test to verify that the new Triton kernel is invoked as expected. But the problem is E2E failed in Triton 3.4.0 (nightly-build). I'm curious about how you ported this kernel. Did you upgrade Triton to 3.4.0 directly, or port this new matmul-og required 3.4.0 Triton kernels into 3.3.1 on-demand? I also tried to port Triton v3.4.0's new functions to v3.3.1 so as to make E2E work, eventually I gave up this approach because the work scope does not converge. Update: But in Triton 3.4.0 the performance is pretty slow. The previous gen throughput (token/s): is 200, and now is gen throughput (token/s): 7.99. |
|
@yuan-luo Yes, you need both nightly build triton and pytorch 2.8 to run these kernels. |
Hi @zyongye , I use pytorch 2.8.0, and triton nightly build, but the performance is still slow, while the result is correct. |
|
This is the benchmark result for the kernels. |
|
After fixing the logic issue, the performance improved. Finally got the result. But batch_size smaller than 256 performance is still lower than legacy fused_moe. |
FIX #16294
Can only merge when migrating to pytorch 3.8 and triton 3.4
Adding triton_kernels from Triton repo.
Performance breakdown:
Setup:
CPU: Intel Xeon Gold 6126 CPU @ 2.60GHz
GPU: RTX 8xA6000, driver version 560.35.05, CUDA 12.6 (shard model with TP if needed)
Dataset: SharedGPT with 100 requests (infqps)
Dependencies:
renormalize=Falsesupport for models like qwen1.5 and mixtral 7x8b.(Updates 06.04.25: Since the triton accepts our PR, we only need nightly build pytorch and triton)
To run the new kernel, set the environment variable
VLLM_USE_EXP_MOE=1