Skip to content

Conversation

@zejunchen-zejun
Copy link

@zejunchen-zejun zejunchen-zejun commented Nov 20, 2025

Integrate the FP4 BMM and unify the env flag VLLM_ROCM_USE_AITER_BMM.
When VLLM_ROCM_USE_AITER_BMM=1(default), the logic is:
When weight in attention part is BF16 dtype, the FP8 BMM is called.
When weight in attention part is U8 dtype, the FP4 BMM is called.

When VLLM_ROCM_USE_AITER_BMM=0, the torch BMM is used.

For model DeepSeek-R1-MXFP4-Preview, whose kv_b_proj weight is U8, the FP4 BMM is used. Here is the associated performance and accuracy.

The FP4 accuracy is:

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9515 ± 0.0059
strict-match 5 exact_match 0.9507 ± 0.0060

The FP4 BMM performance is:
Request throughput (req/s): 1.22
Mean TTFT (ms): 6474.98
Mean TPOT (ms): 44.95

The baseline performance(FP8 BMM) is:
Request throughput (req/s): 1.21
Mean TTFT (ms): 6732.17
Mean TPOT (ms): 45.16

@zejunchen-zejun zejunchen-zejun marked this pull request as draft November 20, 2025 04:15
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/add_fp4_bmm_for_dev_perf branch 6 times, most recently from 9e200a0 to b4152da Compare November 26, 2025 01:50
Signed-off-by: zejunchen-zejun <[email protected]>
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/add_fp4_bmm_for_dev_perf branch from b4152da to 8eb3c34 Compare November 26, 2025 03:41
@zejunchen-zejun zejunchen-zejun marked this pull request as ready for review November 26, 2025 03:59
@zejunchen-zejun
Copy link
Author

zejunchen-zejun commented Nov 26, 2025

Hi, @ZhiweiYan-96
Here is the PR for the FP4 BMM.
The dequant method in post processing is not very efficient.
You can modify it to use switch bit and subbyte storage for post processing U8 weight.

Finally you can upstream this PR to vllm community.

return False


def quant_to_mxfp4(x):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be some utils in vllm for quant&dequant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants