Add Cutlass integration for MoE FP8 #19843

JackChuang · 2025-06-19T07:24:17Z

Purpose

Motivation

With this integration, users can optionally enable the Cutlass backend to improve the performance of MoE FP8 workloads when needed.

Modifications

This PR integrates a Cutlass-based kernel for the MoE FP8 execution path in vLLM. Since the current Cutlass kernel does not support per-block scaling, we adapted the integration by converting the per-block scaling format into a per-tensor equivalent, making it compatible with the existing Cutlass kernel interface.

The implementation is modular and backward-compatible. A flag named VLLM_USE_CUTLASS_MOE_FP8 controls whether to activate the Cutlass kernel. By default, this flag is disabled, ensuring the original execution path remains completely untouched.

Usage:

$ VLLM_USE_CUTLASS_MOE_FP8=1 python3 -m vllm.entrypoints.openai.api_server ...
...

Test Result

Device: H20 * 8 (TP=8)
Model: DeepSeek-R1
vLLM version: v0.8.5.post1 (reason explained in "Found Issue")
Env variable: USE_V1=1

Server's configuration

VLLM_USE_CUTLASS_MOE_FP8=1 VLLM_USE_V1=1 python3 -m vllm.entrypoints.openai.api_server --disable-log-requests --port 8010 --model /nvme0n1/DeepSeek-R1 --trust-remote-code --max-model-len 5120 --max-num-batched-tokens 5120 --tensor-parallel-size 8 --gpu_memory_utilization 0.98 --quantization fp8

Client's configuration

benchmarks/benchmark_serving.py

declare -a input_lens=(3500)
declare -a output_lens=(1500)

Small test scenario:
	--num-prompts 16 --max-concurrency 4 --random-input-len $input_lens --random-output-len $output_lens
Large test scenario:
	--num-prompts 100 --max-concurrency 10 --random-input-len $input_lens --random-output-len $output_lens

Summarized Results

Total Token Throughput (tok/s)

Total Token Throughput (tok/s)	Small (low concurrency + few prompts)	Large (high concurrency + many prompts)
Baseline (Triton)	429.52	425.70
Cutlass	426.93	544.83

Conclusion

Under our setup:

In the large test scenario, Cutlass significantly improved throughput by 27.62%.
In the small test scenario, Cutlass performed similarly to the baseline, with no significant difference.

Found Issue ⚠️❗

This PR is currently blocked by #19923, which tracks an OOM issue on main branch during cutlass_moe_fp8() tensor allocation.
Our current PR is based on v0.8.5.post1. This is because we discovered an open issue that prevents our current setup (H20 with TP=8 running DeepSeek R1) from being rebased onto the latest version. In the older version, cutlass_moe_fp8() allocates a relatively small amount of memory, but the new implementation allocates at least 24.5GB, which appears abnormal. Our current testing environment doesn’t have sufficient memory to run it. If verification against origin/master is required, we would need help from the community to fix this issue first.

Future work

After this version is merged, we plan to introduce shape-, architecture-, and model- specific kernel tuning to further optimize the Cutlass MOE FP8 kernel. This includes releasing H20-specific tuning parameters, as this version does not yet include any such tuning. On the model side, we will provide tailored tuning parameters optimized for DeepSeekR1 to better adapt to its unique workload characteristics. These enhancements will help us fully leverage the performance potential of the Cutlass kernel across various deployment scenarios.
If this version gets merged, the current Cutlass kernel is per-tensor. We will release a per-block version to support users with strict accuracy requirements.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

…#17328) Signed-off-by: mgoin <[email protected]>

Signed-off-by: [email protected] <[email protected]>

…ct results (vllm-project#17574) Signed-off-by: Lucas Wilkinson <[email protected]>

github-actions · 2025-06-19T07:24:26Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-06-19T07:24:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @JackChuang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Summary of Changes

Hello @JackChuang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily integrates a Cutlass-based kernel to accelerate Mixture-of-Experts (MoE) FP8 workloads, offering substantial performance gains for large-scale inference. It includes necessary adaptations for FP8 weight handling and introduces a new environment variable for optional activation. Additionally, it incorporates a memory leak fix in the scheduler and an improvement to FlashAttention's AOT scheduling.

Highlights

Performance Enhancement: Integrated a Cutlass-based kernel for MoE FP8 workloads, demonstrating significant throughput improvements (up to 27.62%) in large test scenarios, particularly beneficial for high-concurrency environments.
Modular Integration: Introduced a new environment variable (VLLM_USE_CUTLASS_MOE_FP8) to optionally enable the Cutlass kernel, ensuring backward compatibility and allowing users to choose their preferred execution path without affecting the default behavior.
FP8 Scaling Adaptation: Implemented a conversion mechanism (fp8_bf16_fp8) to adapt per-block FP8 scaling to a per-tensor format, making it compatible with the current Cutlass kernel interface, as per-block scaling is not yet supported by Cutlass.
Kernel Tuning Configurations: Added numerous new JSON configuration files for various NVIDIA GPU devices (H100, H20, H200) and FP8 quantization settings, providing optimized parameters for the Cutlass MoE FP8 kernels.
Scheduler Memory Fix: Addressed a potential memory leak in the scheduler's update_from_output method, preventing stopped requests from being inadvertently re-cached and ensuring proper memory deallocation.
FlashAttention AOT Improvement: Enhanced FlashAttention's Ahead-of-Time (AOT) scheduling logic to better handle sliding window configurations by ensuring consistent window sizes across attention layers, improving efficiency for certain models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

A flag named `VLLM_USE_CUTLASS_MOE_FP8` controls whether to activate the Cutlass kernel. By default, this flag is disabled, ensuring the original execution path remains completely untouched. Usage: $ VLLM_USE_CUTLASS_MOE_FP8=1 python3 -m vllm.entrypoints.openai.api_server ... Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Yichen Wang <[email protected]>

gemini-code-assist

Code Review

This PR introduces Cutlass integration for MoE FP8, which shows promising performance improvements. It also includes unrelated but valuable enhancements to FlashAttention AOT scheduling and a scheduler memory leak fix.

The main concern is the robustness of the Cutlass MoE FP8 path when dealing with models that are not already in a per-block FP8 quantized format. The current implementation seems to assume the input weights are per-block FP8, which might lead to errors or incorrect behavior if quantizing on-the-fly or using per-tensor FP8 checkpoints with the Cutlass flag enabled. This needs to be addressed or clearly documented as a prerequisite.

Other changes, including the new benchmark configurations, environment variable, and the scheduler/attention backend fixes, look good.

gemini-code-assist · 2025-06-19T07:27:17Z

vllm/model_executor/layers/quantization/fp8.py

+            else:
+                w13_weight, w13_weight_scale_inv = \
+                            self.fp8_bf16_fp8(w13_weight, w13_weight_scale_inv)
+                w2_weight, w2_weight_scale_inv = \
+                            self.fp8_bf16_fp8(w2_weight, w2_weight_scale_inv)
+
+                w13_weight_scale_inv = w13_weight_scale_inv.repeat(w13_weight.size(0))
+                w2_weight_scale_inv = w2_weight_scale_inv.repeat(w2_weight.size(0))
+
+                layer.w13_weight.data.copy_(w13_weight)
+                layer.w13_weight_scale_inv = Parameter(w13_weight_scale_inv, requires_grad=False)
+                layer.w2_weight.data.copy_(w2_weight)
+
+                layer.w2_weight_scale_inv = Parameter(w2_weight_scale_inv, requires_grad=False)


The current logic for the Cutlass path (when envs.VLLM_USE_CUTLASS_MOE_FP8 is true) appears to correctly handle models that are already per-block FP8 quantized (i.e., self.block_quant is true).

However, it seems this path will fail if VLLM_USE_CUTLASS_MOE_FP8=1 is used with:

Models quantized on-the-fly from FP16/BF16 to FP8 (where self.quant_config.is_checkpoint_fp8_serialized is False).

Models loaded from per-tensor FP8 checkpoints (where self.block_quant is False but is_checkpoint_fp8_serialized is True).

In these scenarios, layer.w13_weight_scale_inv (and w2_weight_scale_inv) would not have been initialized as expected by this code block, leading to an AttributeError when accessed (e.g., w13_weight_scale_inv in the call to self.fp8_bf16_fp8).

Recommendation:

This path needs to be robust for all scenarios where FP8 MoE is active and Cutlass is enabled.

For on-the-fly quantization (FP16/BF16 -> FP8 per-tensor): The weights are already per-tensor FP8 after the block starting at line ~687 (in the full file context, handling not self.quant_config.is_checkpoint_fp8_serialized). The fp8_bf16_fp8 conversion is not needed. The existing per-tensor scales (e.g., layer.w13_weight_scale) should be used, potentially renamed/assigned to layer.w13_weight_scale_inv and then repeated as done on line 653.

For per-tensor FP8 checkpoints: Similar to on-the-fly, the weights are already per-tensor. The fp8_bf16_fp8 conversion is not needed. Scales from layer.w13_weight_scale should be used.

This implies that the call to fp8_bf16_fp8 should likely be conditional, e.g., only if self.block_quant is true. Otherwise, the existing per-tensor quantized weights and scales should be prepared for Cutlass (e.g., repeating scales).

Please clarify if VLLM_USE_CUTLASS_MOE_FP8 is only intended for pre-quantized per-block FP8 models. If so, this needs to be very clearly documented, and perhaps an error should be raised if the flag is used with incompatible model/quantization settings.

gemini-code-assist · 2025-06-19T07:27:17Z

vllm/v1/core/sched/scheduler.py

+            # NOTE(rob): since we free stopped reqs above, adding stopped reqs
+            # to _cached_reqs_data will cause a memory leak.
+            if req_data.req_id not in self.finished_req_ids:
+                self._cached_reqs_data[req_data.req_id].append(req_data)


This change correctly prevents adding CachedRequestData for already finished requests back to the _cached_reqs_data pool. This is a good fix that avoids a potential memory leak where CachedRequestData objects for finished requests could accumulate or be incorrectly reused if request IDs were recycled.

This significantly improves the robustness of the scheduler, especially for long-running services.

gemini-code-assist · 2025-06-19T07:27:17Z

vllm/model_executor/layers/quantization/fp8.py

+    def fp8_bf16_fp8(self, fp8_tensor, fp8_scale):
+        blocked_tensor = fp8_tensor.view(
+                                        fp8_tensor.shape[0],
+                                        fp8_tensor.shape[1] // 128, 128,
+                                        fp8_tensor.shape[2] // 128,
+                                        128).to(torch.float32)
+        # Because blocked_tensor is 5D, reshape to [B, M//128, 1, N//128, 1]
+        dequant_tensor = (blocked_tensor *
+                  fp8_scale.unsqueeze(2).unsqueeze(4)).view(
+                      fp8_tensor.shape).to(torch.bfloat16).to(torch.float32)
+
+        scale_tensor = torch.abs(dequant_tensor).max() / 448
+        quant_tensor = dequant_tensor / scale_tensor
+
+        return quant_tensor, scale_tensor


This function converts a per-block quantized FP8 tensor to a format suitable for a per-tensor quantized FP8 kernel. Consider adding a docstring to clarify:

Inputs: fp8_tensor is expected to be torch.float8_e4m3fn and fp8_scale its corresponding per-block scales.

Output: quant_tensor is returned as torch.float32 (representing values that would be FP8 if cast using scale_tensor), and scale_tensor is the new per-tensor scale.

Assumptions: The .view(...) operation on fp8_tensor (lines 601-605) assumes that fp8_tensor.shape[1] and fp8_tensor.shape[2] are divisible by 128. This should be mentioned.

Naming: The name fp8_bf16_fp8 could be slightly misleading. Perhaps a name like convert_fp8_block_to_tensor_scaled or similar, or a very clear docstring, would improve clarity.

Introduced optional support for using Cutlass kernels in the MoE FP8 execution path by converting the per-block scaling format into a per-tensor equivalent, making it compatible with the existing Cutlass kernel interface. Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Co-authored-by: Yichen Wang <[email protected]>

mgoin and others added 3 commits April 28, 2025 15:22

[Model] Add tuned triton fused_moe configs for Qwen3Moe (vllm-project…

ba41cc9

…#17328) Signed-off-by: mgoin <[email protected]>

[BugFix] Fix Memory Leak (vllm-project#17567)

edb5286

Signed-off-by: [email protected] <[email protected]>

[BugFix][Attention] Fix sliding window attention in V1 giving incorre…

3015d56

…ct results (vllm-project#17574) Signed-off-by: Lucas Wilkinson <[email protected]>

JackChuang requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, mgoin and tlrmchlsmth as code owners June 19, 2025 07:24

mergify bot added the v1 label Jun 19, 2025

gemini-code-assist bot reviewed Jun 19, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 19, 2025

gemini-code-assist bot reviewed Jun 19, 2025

View reviewed changes

JackChuang force-pushed the horenc/w8a8_cutlass_pertensor_release_v0.8.5.post1 branch from 41022d2 to 84f6ceb Compare June 19, 2025 07:29

JackChuang mentioned this pull request Jun 20, 2025

[Bug]: Unexpected Memory Usage in cutlass_moe_fp8() on Latest main 6bc7b57 #19923

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Cutlass integration for MoE FP8 #19843

Add Cutlass integration for MoE FP8 #19843

JackChuang commented Jun 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 19, 2025

Uh oh!

mergify bot commented Jun 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 19, 2025

Uh oh!

gemini-code-assist bot Jun 19, 2025

Uh oh!

gemini-code-assist bot Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Add Cutlass integration for MoE FP8 #19843

Are you sure you want to change the base?

Add Cutlass integration for MoE FP8 #19843

Conversation

JackChuang commented Jun 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Motivation

Modifications

Test Result

Server's configuration

Client's configuration

Summarized Results

Conclusion

Found Issue ⚠️❗

Future work

Essential Elements of an Effective PR Description Checklist

Uh oh!

github-actions bot commented Jun 19, 2025

Uh oh!

mergify bot commented Jun 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JackChuang commented Jun 19, 2025 •

edited by github-actions bot

Loading