-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Add Cutlass integration for MoE FP8 #19843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Cutlass integration for MoE FP8 #19843
Conversation
…#17328) Signed-off-by: mgoin <[email protected]>
…ct results (vllm-project#17574) Signed-off-by: Lucas Wilkinson <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @JackChuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request primarily integrates a Cutlass-based kernel to accelerate Mixture-of-Experts (MoE) FP8 workloads, offering substantial performance gains for large-scale inference. It includes necessary adaptations for FP8 weight handling and introduces a new environment variable for optional activation. Additionally, it incorporates a memory leak fix in the scheduler and an improvement to FlashAttention's AOT scheduling.
Highlights
- Performance Enhancement: Integrated a Cutlass-based kernel for MoE FP8 workloads, demonstrating significant throughput improvements (up to 27.62%) in large test scenarios, particularly beneficial for high-concurrency environments.
- Modular Integration: Introduced a new environment variable (
VLLM_USE_CUTLASS_MOE_FP8
) to optionally enable the Cutlass kernel, ensuring backward compatibility and allowing users to choose their preferred execution path without affecting the default behavior. - FP8 Scaling Adaptation: Implemented a conversion mechanism (
fp8_bf16_fp8
) to adapt per-block FP8 scaling to a per-tensor format, making it compatible with the current Cutlass kernel interface, as per-block scaling is not yet supported by Cutlass. - Kernel Tuning Configurations: Added numerous new JSON configuration files for various NVIDIA GPU devices (H100, H20, H200) and FP8 quantization settings, providing optimized parameters for the Cutlass MoE FP8 kernels.
- Scheduler Memory Fix: Addressed a potential memory leak in the scheduler's
update_from_output
method, preventing stopped requests from being inadvertently re-cached and ensuring proper memory deallocation. - FlashAttention AOT Improvement: Enhanced FlashAttention's Ahead-of-Time (AOT) scheduling logic to better handle sliding window configurations by ensuring consistent window sizes across attention layers, improving efficiency for certain models.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
A flag named `VLLM_USE_CUTLASS_MOE_FP8` controls whether to activate the Cutlass kernel. By default, this flag is disabled, ensuring the original execution path remains completely untouched. Usage: $ VLLM_USE_CUTLASS_MOE_FP8=1 python3 -m vllm.entrypoints.openai.api_server ... Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Yichen Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces Cutlass integration for MoE FP8, which shows promising performance improvements. It also includes unrelated but valuable enhancements to FlashAttention AOT scheduling and a scheduler memory leak fix.
The main concern is the robustness of the Cutlass MoE FP8 path when dealing with models that are not already in a per-block FP8 quantized format. The current implementation seems to assume the input weights are per-block FP8, which might lead to errors or incorrect behavior if quantizing on-the-fly or using per-tensor FP8 checkpoints with the Cutlass flag enabled. This needs to be addressed or clearly documented as a prerequisite.
Other changes, including the new benchmark configurations, environment variable, and the scheduler/attention backend fixes, look good.
else: | ||
w13_weight, w13_weight_scale_inv = \ | ||
self.fp8_bf16_fp8(w13_weight, w13_weight_scale_inv) | ||
w2_weight, w2_weight_scale_inv = \ | ||
self.fp8_bf16_fp8(w2_weight, w2_weight_scale_inv) | ||
|
||
w13_weight_scale_inv = w13_weight_scale_inv.repeat(w13_weight.size(0)) | ||
w2_weight_scale_inv = w2_weight_scale_inv.repeat(w2_weight.size(0)) | ||
|
||
layer.w13_weight.data.copy_(w13_weight) | ||
layer.w13_weight_scale_inv = Parameter(w13_weight_scale_inv, requires_grad=False) | ||
layer.w2_weight.data.copy_(w2_weight) | ||
|
||
layer.w2_weight_scale_inv = Parameter(w2_weight_scale_inv, requires_grad=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for the Cutlass path (when envs.VLLM_USE_CUTLASS_MOE_FP8
is true) appears to correctly handle models that are already per-block FP8 quantized (i.e., self.block_quant
is true).
However, it seems this path will fail if VLLM_USE_CUTLASS_MOE_FP8=1
is used with:
- Models quantized on-the-fly from FP16/BF16 to FP8 (where
self.quant_config.is_checkpoint_fp8_serialized
isFalse
). - Models loaded from per-tensor FP8 checkpoints (where
self.block_quant
isFalse
butis_checkpoint_fp8_serialized
isTrue
).
In these scenarios, layer.w13_weight_scale_inv
(and w2_weight_scale_inv
) would not have been initialized as expected by this code block, leading to an AttributeError
when accessed (e.g., w13_weight_scale_inv
in the call to self.fp8_bf16_fp8
).
Recommendation:
This path needs to be robust for all scenarios where FP8 MoE is active and Cutlass is enabled.
- For on-the-fly quantization (FP16/BF16 -> FP8 per-tensor): The weights are already per-tensor FP8 after the block starting at line ~687 (in the full file context, handling
not self.quant_config.is_checkpoint_fp8_serialized
). Thefp8_bf16_fp8
conversion is not needed. The existing per-tensor scales (e.g.,layer.w13_weight_scale
) should be used, potentially renamed/assigned tolayer.w13_weight_scale_inv
and then repeated as done on line 653. - For per-tensor FP8 checkpoints: Similar to on-the-fly, the weights are already per-tensor. The
fp8_bf16_fp8
conversion is not needed. Scales fromlayer.w13_weight_scale
should be used.
This implies that the call to fp8_bf16_fp8
should likely be conditional, e.g., only if self.block_quant
is true. Otherwise, the existing per-tensor quantized weights and scales should be prepared for Cutlass (e.g., repeating scales).
Please clarify if VLLM_USE_CUTLASS_MOE_FP8
is only intended for pre-quantized per-block FP8 models. If so, this needs to be very clearly documented, and perhaps an error should be raised if the flag is used with incompatible model/quantization settings.
# NOTE(rob): since we free stopped reqs above, adding stopped reqs | ||
# to _cached_reqs_data will cause a memory leak. | ||
if req_data.req_id not in self.finished_req_ids: | ||
self._cached_reqs_data[req_data.req_id].append(req_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly prevents adding CachedRequestData
for already finished requests back to the _cached_reqs_data
pool. This is a good fix that avoids a potential memory leak where CachedRequestData
objects for finished requests could accumulate or be incorrectly reused if request IDs were recycled.
This significantly improves the robustness of the scheduler, especially for long-running services.
def fp8_bf16_fp8(self, fp8_tensor, fp8_scale): | ||
blocked_tensor = fp8_tensor.view( | ||
fp8_tensor.shape[0], | ||
fp8_tensor.shape[1] // 128, 128, | ||
fp8_tensor.shape[2] // 128, | ||
128).to(torch.float32) | ||
# Because blocked_tensor is 5D, reshape to [B, M//128, 1, N//128, 1] | ||
dequant_tensor = (blocked_tensor * | ||
fp8_scale.unsqueeze(2).unsqueeze(4)).view( | ||
fp8_tensor.shape).to(torch.bfloat16).to(torch.float32) | ||
|
||
scale_tensor = torch.abs(dequant_tensor).max() / 448 | ||
quant_tensor = dequant_tensor / scale_tensor | ||
|
||
return quant_tensor, scale_tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function converts a per-block quantized FP8 tensor to a format suitable for a per-tensor quantized FP8 kernel. Consider adding a docstring to clarify:
- Inputs:
fp8_tensor
is expected to betorch.float8_e4m3fn
andfp8_scale
its corresponding per-block scales. - Output:
quant_tensor
is returned astorch.float32
(representing values that would be FP8 if cast usingscale_tensor
), andscale_tensor
is the new per-tensor scale. - Assumptions: The
.view(...)
operation onfp8_tensor
(lines 601-605) assumes thatfp8_tensor.shape[1]
andfp8_tensor.shape[2]
are divisible by 128. This should be mentioned. - Naming: The name
fp8_bf16_fp8
could be slightly misleading. Perhaps a name likeconvert_fp8_block_to_tensor_scaled
or similar, or a very clear docstring, would improve clarity.
Introduced optional support for using Cutlass kernels in the MoE FP8 execution path by converting the per-block scaling format into a per-tensor equivalent, making it compatible with the existing Cutlass kernel interface. Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Yichen Wang <[email protected]>
41022d2
to
84f6ceb
Compare
Purpose
Motivation
With this integration, users can optionally enable the Cutlass backend to improve the performance of MoE FP8 workloads when needed.
Modifications
This PR integrates a Cutlass-based kernel for the MoE FP8 execution path in vLLM. Since the current Cutlass kernel does not support per-block scaling, we adapted the integration by converting the per-block scaling format into a per-tensor equivalent, making it compatible with the existing Cutlass kernel interface.
The implementation is modular and backward-compatible. A flag named
VLLM_USE_CUTLASS_MOE_FP8
controls whether to activate the Cutlass kernel. By default, this flag is disabled, ensuring the original execution path remains completely untouched.Usage:
Test Result
Device: H20 * 8 (TP=8)
Model: DeepSeek-R1
vLLM version: v0.8.5.post1 (reason explained in "Found Issue")
Env variable: USE_V1=1
Server's configuration
Client's configuration
Summarized Results
Total Token Throughput (tok/s)
Conclusion
Under our setup:
large
test scenario, Cutlass significantly improved throughput by 27.62%.small
test scenario, Cutlass performed similarly to the baseline, with no significant difference.Found Issue⚠️ ❗
This PR is currently blocked by #19923, which tracks an OOM issue on
main
branch duringcutlass_moe_fp8()
tensor allocation.Our current PR is based on v0.8.5.post1. This is because we discovered an open issue that prevents our current setup (H20 with TP=8 running DeepSeek R1) from being rebased onto the latest version. In the older version, cutlass_moe_fp8() allocates a relatively small amount of memory, but the new implementation allocates at least 24.5GB, which appears abnormal. Our current testing environment doesn’t have sufficient memory to run it. If verification against origin/master is required, we would need help from the community to fix this issue first.
Future work
After this version is merged, we plan to introduce shape-, architecture-, and model- specific kernel tuning to further optimize the Cutlass MOE FP8 kernel. This includes releasing H20-specific tuning parameters, as this version does not yet include any such tuning. On the model side, we will provide tailored tuning parameters optimized for DeepSeekR1 to better adapt to its unique workload characteristics. These enhancements will help us fully leverage the performance potential of the Cutlass kernel across various deployment scenarios.
If this version gets merged, the current Cutlass kernel is per-tensor. We will release a per-block version to support users with strict accuracy requirements.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.