[Hardware][AMD] integrate aiter chunked prefill into vllm #18596

Zzz9990 · 2025-05-23T06:49:18Z

CMD: VLLM_TORCH_PROFILER_DIR=/mnt/raid0/sixifang/vllm/vllm_profile HIP_VISIBLE_DEVICES=4,5,6,7 VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve /models/models--amd--Meta-Llama-3.1-8B-Instruct-FP8-KV/snapshots/fa42f9a9105c545755fea25cf69f49ac8c8b40e1/ --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --trust-remote-code --disable-log-requests --block-size 16 --max-model-len 32768 --dtype float16 --quantization fp8 --no-enable-prefix-caching --max-num-batched-tokens=8192

Performance with aiter:

prompts	isl	osl	con	req_throughput	median_e2e	median_ttft	median_tpot	median_itl	output_tps	total_tps
1280	2048	1024	128	2.18	58777.36	1041.87	56.22	42.28	2236.56	6707.49

Performance without aiter:

prompts	isl	osl	con	req_throughput	median_e2e	median_ttft	median_tpot	median_itl	output_tps	total_tps
1280	2048	1024	128	0.94	135871.09	2621.44	129.89	94.26	964.17	2891.56

Signed-off-by: fsx950223 <[email protected]>

github-actions · 2025-05-23T06:49:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-05-23T06:50:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Zzz9990.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tjtanaa · 2025-05-23T15:58:07Z

@Zzz9990 What is the difference with this PR? #17710
Could you rename the PR to be semantically related to the feature you are adding? "integrate aiter into vllm" is too broad and generic

mergify · 2025-05-26T06:55:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Zzz9990.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: fsx950223 <[email protected]>

tjtanaavllm · 2025-05-26T08:14:13Z

vllm/model_executor/layers/layernorm.py


-    import aiter as rocm_aiter
+    try:


@Zzz9990
I think we don't need to use the try catch statement. Because the registration must work as vLLM is going to deprecate V0. If registration does not work when aiter is present on ROCm env, this could mean there is a bug.

An example unit tests to check if the registration works is as follows https://github.com/vllm-project/vllm/blob/main/tests/kernels/moe/test_rocm_aiter_topk.py

Signed-off-by: fsx950223 <[email protected]>

maleksan85 · 2025-05-27T17:01:02Z

vllm/v1/attention/backends/rocm_aiter_fa.py

+                max_seqlen_k = attn_metadata.max_seq_len
+                block_table = attn_metadata.block_table
+
+            if max_seqlen_q > 1:


did you mean max_query_len here to check that there are prefills in batch to process in flash_attn_varlen_func?

SageMoore

Thanks for the contribution. The FP8 speedups look good! It looks like you can clean up the backend a bit by removing the cascade attention logic, but otherwise this PR looks reasonable.

SageMoore · 2025-05-28T14:23:58Z

vllm/v1/attention/backends/rocm_aiter_fa.py

+            )
+            return output
+        else:
+            raise NotImplementedError(


Since cascade attention is not supported, can you just remove all of the setup code for it?

tjtanaa · 2025-05-28T14:42:31Z

vllm/v1/attention/backends/rocm_aiter_fa.py

+    ) -> None:
+        if blocksparse_params is not None:
+            raise ValueError(
+                "FlashAttention does not support block-sparse attention.")


Can you update this to "AiterFlashAttention does not support block-sparse attention"? Can you do this for this whole file to make the error message and warning more explicit?

tjtanaa · 2025-05-28T14:42:56Z

vllm/v1/attention/backends/rocm_aiter_fa.py

+            AiterFlashAttentionBackend.get_supported_head_sizes()
+        if head_size not in support_head_sizes:
+            raise ValueError(
+                f"Head size {head_size} is not supported by FlashAttention. "


Should we update this to "AiterFlashAttention"?

tjtanaa · 2025-05-28T15:12:05Z

vllm/platforms/rocm.py

-                    "triton_attn.TritonAttentionBackend")
+            if envs.VLLM_ROCM_USE_AITER and envs.VLLM_ROCM_USE_AITER_MHA \
+                and on_mi250_mi300():
+                logger.info("Using Flash Attention backend on V1 engine.")


Should we update this to "AiterFlashAttention" ?

gshtras · 2025-05-28T16:56:29Z

vllm/v1/attention/backends/rocm_aiter_fa.py

+            v=v,
+            cu_seqlens_q=cu_seqlens_q,
+            max_seqlen_q=max_seqlen_q,
+            min_seqlen_q=1,


This min_seqlen_q appears to be a new parameter in the aiter.
Which aiter commit is required for it to work?
How is it guarded against the version that is currently in our dockers?

648764942e552a8bb5fe16026703716a81f05374

@fsx950223 can you add this bug fix for vLLM as AITER commit: 648764942e552a8bb5fe16026703716a81f05374 has changes in the enum of fused moe

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

Please add this line:

class QuantMethod(IntEnum): # This allows interfacing with AITER QuantType Enum # without importing the QuantType from AITER globally. # Note that these quantization methods are # supported in AITER package. However, # not all are used in this module. NO = 0 # a16w16 PER_TENSOR = 1 # w8a8 (pre_Tensor) PER_TOKEN = 2 # w8a8/w8a4 (per_Token) + BLOCK_1X32 = 3 # block quantized w8a8 1x32) BLOCK_1X128 = 4 # block quantized w8a8 (per_1x128) BLOCK_128x128 = 5 # block quantized w8a8 (per_128x128)

Then try to run DeepSeekV3 R1 to validate the code path

VLLM_USE_V1=1 \ VLLM_ROCM_USE_AITER=1 \ SAFETENSORS_FAST_GPU=1 \ vllm serve \ --model deepseek-ai/DeepSeek-V3 \ -tp 8 \ --no-enable-chunked-prefill \ --max-model-len 32768 \ --trust-remote-code \ --block-size 1 \ --max_seq_len_to_capture 32768 \ --trust-remote-code --swap-space 16 --distributed-executor-backend mp

@fsx950223 can we know why min_seqlen_q=1 while the default value is set to 0 in aiter package?. would changing this min value to higher number be affecting the performance? what happens if min_seqlen_q is set to the same value as max_seqlen_q? Asking this cuz this kernel is also used in ROCM_AITER_MLA backend as well for chunked prefill as shown below for both v0 and v1 engine.

vllm/v1/attention/backends/mla/rocm_aiter_mla.py

vllm/vllm/v1/attention/backends/mla/rocm_aiter_mla.py

Lines 160 to 179 in 4f4a6b8

from aiter import flash_attn_varlen_func

self.flash_attn_varlen_func = flash_attn_varlen_func

def _flash_attn_varlen_diff_headdims(self,

q,

k,

v,

return_softmax_lse=False,

softmax_scale=None,

**kwargs):

output = self.flash_attn_varlen_func(

q=q,

k=k,

v=v,

softmax_scale=softmax_scale,

return_lse=return_softmax_lse,

**kwargs,

)

return output

vllm/attention/backends/rocm_aiter_mla.py

vllm/vllm/attention/backends/rocm_aiter_mla.py

Lines 388 to 402 in 4f4a6b8

from aiter import flash_attn_varlen_func

self.flash_attn_varlen_func = flash_attn_varlen_func

def _flash_attn_varlen_diff_headdims(

self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor,

softmax_scale: float, return_softmax_lse: bool,

**kwargs) -> Union[tuple[torch.Tensor, ...], torch.Tensor]:

output = self.flash_attn_varlen_func(

q,

k,

v,

**kwargs,

)

return output

Signed-off-by: fsx950223 <[email protected]>

tjtanaa · 2025-05-30T07:09:53Z

@Zzz9990 can we have more holistic evaluation of whether this AITER MHA benefits most models in general when compared with chunckedprefill attention before we set this AITER MHA on by default? What you have tested with might be unified attention instead. On my end, the chucked prefill attention is many cases are better than unified attention. On Qwen3 235B FP8 v1.

Some representative models suggested to be evaluated:
-Llama3.3 70B
-Mixtral 8x7B instruct
-Llama4-Maverick-FP8

Moreover, I noticed that this feature does not support 8bit KVcache. --kv-cache-dtype

Could we have more of your thoughts on this? I would appreciate a discussion on this matter to bring out the best user experience when it comes to using AITER kernels.

Does it benefits FP8 models or Bf16 and FP16 as well?

fsx950223 added 8 commits May 8, 2025 05:34

integrate aiter

d483fc2

Signed-off-by: fsx950223 <[email protected]>

add env variable

4f85566

Signed-off-by: fsx950223 <[email protected]>

rename function

ae85e79

Signed-off-by: fsx950223 <[email protected]>

optimize kernels with small query lens

87ea0ba

Signed-off-by: fsx950223 <[email protected]>

change condition

db4bc55

Signed-off-by: fsx950223 <[email protected]>

Merge remote-tracking branch 'upstream/main' into fa_upstream

efe59bd

Signed-off-by: fsx950223 <[email protected]>

add rocm aiter backend

40654e4

Signed-off-by: fsx950223 <[email protected]>

new fa impl

8dd236d

Signed-off-by: fsx950223 <[email protected]>

Zzz9990 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 23, 2025 06:49

mergify bot added the v1 label May 23, 2025

mergify bot added the needs-rebase label May 23, 2025

Zzz9990 changed the title ~~[Hardware][AMD] integrate aiter into vll~~ [Hardware][AMD] integrate aiter chunked prefill into vllm May 26, 2025

Zzz9990 closed this May 26, 2025

Zzz9990 reopened this May 26, 2025

fsx950223 added 2 commits May 26, 2025 07:09

update api

e995fdc

Signed-off-by: fsx950223 <[email protected]>

optimize performance

a501ff0

Signed-off-by: fsx950223 <[email protected]>

fsx950223 force-pushed the fa_upstream3 branch from 71081a3 to a501ff0 Compare May 26, 2025 07:09

fsx950223 requested review from hmellor, tlrmchlsmth, mgoin and zhuohan123 as code owners May 26, 2025 07:12

fsx950223 requested a review from youkaichao as a code owner May 26, 2025 07:12

mergify bot added documentation Improvements or additions to documentation ci/build labels May 26, 2025

merge

3ad9e3a

Signed-off-by: fsx950223 <[email protected]>

fsx950223 force-pushed the fa_upstream3 branch from e4d3efe to 3ad9e3a Compare May 26, 2025 07:25

mergify bot removed the needs-rebase label May 26, 2025

tjtanaavllm reviewed May 26, 2025

View reviewed changes

remove try catch

05e460f

Signed-off-by: fsx950223 <[email protected]>

maleksan85 reviewed May 27, 2025

View reviewed changes

SageMoore reviewed May 28, 2025

View reviewed changes

tjtanaa reviewed May 28, 2025

View reviewed changes

gshtras reviewed May 28, 2025

View reviewed changes

clean code

0a337a6

Signed-off-by: fsx950223 <[email protected]>

	from aiter import flash_attn_varlen_func
	self.flash_attn_varlen_func = flash_attn_varlen_func

	def _flash_attn_varlen_diff_headdims(self,
	q,
	k,
	v,
	return_softmax_lse=False,
	softmax_scale=None,
	**kwargs):
	output = self.flash_attn_varlen_func(
	q=q,
	k=k,
	v=v,
	softmax_scale=softmax_scale,
	return_lse=return_softmax_lse,
	**kwargs,
	)

	return output

	from aiter import flash_attn_varlen_func
	self.flash_attn_varlen_func = flash_attn_varlen_func

	def _flash_attn_varlen_diff_headdims(
	self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor,
	softmax_scale: float, return_softmax_lse: bool,
	**kwargs) -> Union[tuple[torch.Tensor, ...], torch.Tensor]:
	output = self.flash_attn_varlen_func(
	q,
	k,
	v,
	**kwargs,
	)

	return output

Uh oh!

[Hardware][AMD] integrate aiter chunked prefill into vllm #18596

Are you sure you want to change the base?

[Hardware][AMD] integrate aiter chunked prefill into vllm #18596

Uh oh!

Conversation

Zzz9990 commented May 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 23, 2025

Uh oh!

mergify bot commented May 23, 2025

Uh oh!

tjtanaa commented May 23, 2025

Uh oh!

mergify bot commented May 26, 2025

Uh oh!

tjtanaavllm May 26, 2025

Choose a reason for hiding this comment

Uh oh!

fsx950223 May 27, 2025

Choose a reason for hiding this comment

Uh oh!

maleksan85 May 27, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

SageMoore May 28, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa May 28, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa May 28, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras May 28, 2025

Choose a reason for hiding this comment

Uh oh!

fsx950223 May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vllmellm May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Zzz9990 commented May 23, 2025 •

edited by github-actions bot

Loading

tjtanaa May 28, 2025 •

edited

Loading

fsx950223 May 29, 2025 •

edited

Loading

tjtanaa May 30, 2025 •

edited

Loading

vllmellm May 30, 2025 •

edited

Loading

tjtanaa commented May 30, 2025 •

edited

Loading