[ROCm][FP4 BMM] integrate FP4 BMM #816

zejunchen-zejun · 2025-11-20T04:15:35Z

Integrate the FP4 BMM and unify the env flag VLLM_ROCM_USE_AITER_BMM.
When VLLM_ROCM_USE_AITER_BMM=1(default), the logic is:
When weight in attention part is BF16 dtype, the FP8 BMM is called.
When weight in attention part is U8 dtype, the FP4 BMM is called.

When VLLM_ROCM_USE_AITER_BMM=0, the torch BMM is used.

For model DeepSeek-R1-MXFP4-Preview, whose kv_b_proj weight is U8, the FP4 BMM is used. Here is the associated performance and accuracy.

The FP4 accuracy is:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9515	±	0.0059
		strict-match	5	exact_match	↑	0.9507	±	0.0060

The FP4 BMM performance is:
Request throughput (req/s): 1.22
Mean TTFT (ms): 6474.98
Mean TPOT (ms): 44.95

The baseline performance(FP8 BMM) is:
Request throughput (req/s): 1.21
Mean TTFT (ms): 6732.17
Mean TPOT (ms): 45.16

Signed-off-by: zejunchen-zejun <[email protected]>

zejunchen-zejun · 2025-11-26T04:01:15Z

Hi, @ZhiweiYan-96
Here is the PR for the FP4 BMM.
The dequant method in post processing is not very efficient.
You can modify it to use switch bit and subbyte storage for post processing U8 weight.

Finally you can upstream this PR to vllm community.

ZhiweiYan-96 · 2025-12-03T02:43:54Z

vllm/model_executor/layers/quantization/quark/utils.py

    return False
+
+
+def quant_to_mxfp4(x):


there should be some utils in vllm for quant&dequant

zejunchen-zejun requested review from kliuae-amd, tjtanaavllm and wuhuikx as code owners November 20, 2025 04:15

zejunchen-zejun marked this pull request as draft November 20, 2025 04:15

zejunchen-zejun force-pushed the zejun/add_fp4_bmm_for_dev_perf branch 6 times, most recently from 9e200a0 to b4152da Compare November 26, 2025 01:50

[ROCm][FP4 BMM] integrate FP4 BMM

8eb3c34

Signed-off-by: zejunchen-zejun <[email protected]>

zejunchen-zejun force-pushed the zejun/add_fp4_bmm_for_dev_perf branch from b4152da to 8eb3c34 Compare November 26, 2025 03:41

zejunchen-zejun marked this pull request as ready for review November 26, 2025 03:59

zejunchen-zejun requested a review from ZhiweiYan-96 November 26, 2025 04:02

ZhiweiYan-96 reviewed Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][FP4 BMM] integrate FP4 BMM #816

[ROCm][FP4 BMM] integrate FP4 BMM #816

Uh oh!

zejunchen-zejun commented Nov 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

zejunchen-zejun commented Nov 26, 2025 •

edited

Loading

Uh oh!

ZhiweiYan-96 Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ROCm][FP4 BMM] integrate FP4 BMM #816

Are you sure you want to change the base?

[ROCm][FP4 BMM] integrate FP4 BMM #816

Uh oh!

Conversation

zejunchen-zejun commented Nov 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zejunchen-zejun commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZhiweiYan-96 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zejunchen-zejun commented Nov 20, 2025 •

edited by github-actions bot

Loading

zejunchen-zejun commented Nov 26, 2025 •

edited

Loading