Skip to content

[Hardware][AMD] integrate aiter chunked prefill into vllm #18596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

Zzz9990
Copy link

@Zzz9990 Zzz9990 commented May 23, 2025

CMD: VLLM_TORCH_PROFILER_DIR=/mnt/raid0/sixifang/vllm/vllm_profile HIP_VISIBLE_DEVICES=4,5,6,7 VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve /models/models--amd--Meta-Llama-3.1-8B-Instruct-FP8-KV/snapshots/fa42f9a9105c545755fea25cf69f49ac8c8b40e1/ --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --trust-remote-code --disable-log-requests --block-size 16 --max-model-len 32768 --dtype float16 --quantization fp8 --no-enable-prefix-caching --max-num-batched-tokens=8192

Performance with aiter:

prompts isl osl con req_throughput median_e2e median_ttft median_tpot median_itl output_tps total_tps
1280 2048 1024 128 2.18 58777.36 1041.87 56.22 42.28 2236.56 6707.49

Performance without aiter:

prompts isl osl con req_throughput median_e2e median_ttft median_tpot median_itl output_tps total_tps
1280 2048 1024 128 0.94 135871.09 2621.44 129.89 94.26 964.17 2891.56

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label May 23, 2025
Copy link

mergify bot commented May 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Zzz9990.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 23, 2025
@tjtanaa
Copy link
Contributor

tjtanaa commented May 23, 2025

@Zzz9990 What is the difference with this PR? #17710
Could you rename the PR to be semantically related to the feature you are adding? "integrate aiter into vllm" is too broad and generic

@Zzz9990 Zzz9990 changed the title [Hardware][AMD] integrate aiter into vll [Hardware][AMD] integrate aiter chunked prefill into vllm May 26, 2025
@Zzz9990 Zzz9990 closed this May 26, 2025
@Zzz9990 Zzz9990 reopened this May 26, 2025
Copy link

mergify bot commented May 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Zzz9990.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

fsx950223 added 2 commits May 26, 2025 07:09
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
@fsx950223 fsx950223 requested a review from youkaichao as a code owner May 26, 2025 07:12
@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels May 26, 2025
Signed-off-by: fsx950223 <[email protected]>

import aiter as rocm_aiter
try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zzz9990
I think we don't need to use the try catch statement. Because the registration must work as vLLM is going to deprecate V0. If registration does not work when aiter is present on ROCm env, this could mean there is a bug.

An example unit tests to check if the registration works is as follows https://github.com/vllm-project/vllm/blob/main/tests/kernels/moe/test_rocm_aiter_topk.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: fsx950223 <[email protected]>
max_seqlen_k = attn_metadata.max_seq_len
block_table = attn_metadata.block_table

if max_seqlen_q > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean max_query_len here to check that there are prefills in batch to process in flash_attn_varlen_func?

Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. The FP8 speedups look good! It looks like you can clean up the backend a bit by removing the cascade attention logic, but otherwise this PR looks reasonable.

)
return output
else:
raise NotImplementedError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cascade attention is not supported, can you just remove all of the setup code for it?

) -> None:
if blocksparse_params is not None:
raise ValueError(
"FlashAttention does not support block-sparse attention.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this to "AiterFlashAttention does not support block-sparse attention"? Can you do this for this whole file to make the error message and warning more explicit?

AiterFlashAttentionBackend.get_supported_head_sizes()
if head_size not in support_head_sizes:
raise ValueError(
f"Head size {head_size} is not supported by FlashAttention. "
Copy link
Contributor

@tjtanaa tjtanaa May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update this to "AiterFlashAttention"?

"triton_attn.TritonAttentionBackend")
if envs.VLLM_ROCM_USE_AITER and envs.VLLM_ROCM_USE_AITER_MHA \
and on_mi250_mi300():
logger.info("Using Flash Attention backend on V1 engine.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update this to "AiterFlashAttention" ?

v=v,
cu_seqlens_q=cu_seqlens_q,
max_seqlen_q=max_seqlen_q,
min_seqlen_q=1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This min_seqlen_q appears to be a new parameter in the aiter.
Which aiter commit is required for it to work?
How is it guarded against the version that is currently in our dockers?

Copy link

@fsx950223 fsx950223 May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

648764942e552a8bb5fe16026703716a81f05374

Copy link
Contributor

@tjtanaa tjtanaa May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsx950223 can you add this bug fix for vLLM as AITER commit: 648764942e552a8bb5fe16026703716a81f05374 has changes in the enum of fused moe

vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py

Please add this line:

class QuantMethod(IntEnum):
    # This allows interfacing with AITER QuantType Enum
    # without importing the QuantType from AITER globally.

    # Note that these quantization methods are
    # supported in AITER package. However,
    # not all are used in this module.

    NO = 0  # a16w16
    PER_TENSOR = 1  # w8a8 (pre_Tensor)
    PER_TOKEN = 2  # w8a8/w8a4 (per_Token)
+    BLOCK_1X32 = 3  # block quantized w8a8 1x32)
    BLOCK_1X128 = 4  # block quantized w8a8 (per_1x128)
    BLOCK_128x128 = 5  # block quantized w8a8 (per_128x128)

Then try to run DeepSeekV3 R1 to validate the code path

VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
vllm serve \
--model deepseek-ai/DeepSeek-V3 \
-tp 8 \
--no-enable-chunked-prefill \
--max-model-len 32768 \
--trust-remote-code \
--block-size 1 \
--max_seq_len_to_capture 32768 \
--trust-remote-code --swap-space 16 --distributed-executor-backend mp

Copy link
Contributor

@vllmellm vllmellm May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsx950223 can we know why min_seqlen_q=1 while the default value is set to 0 in aiter package?. would changing this min value to higher number be affecting the performance? what happens if min_seqlen_q is set to the same value as max_seqlen_q? Asking this cuz this kernel is also used in ROCM_AITER_MLA backend as well for chunked prefill as shown below for both v0 and v1 engine.

vllm/v1/attention/backends/mla/rocm_aiter_mla.py

from aiter import flash_attn_varlen_func
self.flash_attn_varlen_func = flash_attn_varlen_func
def _flash_attn_varlen_diff_headdims(self,
q,
k,
v,
return_softmax_lse=False,
softmax_scale=None,
**kwargs):
output = self.flash_attn_varlen_func(
q=q,
k=k,
v=v,
softmax_scale=softmax_scale,
return_lse=return_softmax_lse,
**kwargs,
)
return output

vllm/attention/backends/rocm_aiter_mla.py

from aiter import flash_attn_varlen_func
self.flash_attn_varlen_func = flash_attn_varlen_func
def _flash_attn_varlen_diff_headdims(
self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor,
softmax_scale: float, return_softmax_lse: bool,
**kwargs) -> Union[tuple[torch.Tensor, ...], torch.Tensor]:
output = self.flash_attn_varlen_func(
q,
k,
v,
**kwargs,
)
return output

Signed-off-by: fsx950223 <[email protected]>
@tjtanaa
Copy link
Contributor

tjtanaa commented May 30, 2025

@Zzz9990 can we have more holistic evaluation of whether this AITER MHA benefits most models in general when compared with chunckedprefill attention before we set this AITER MHA on by default? What you have tested with might be unified attention instead. On my end, the chucked prefill attention is many cases are better than unified attention. On Qwen3 235B FP8 v1.

Some representative models suggested to be evaluated:
-Llama3.3 70B
-Mixtral 8x7B instruct
-Llama4-Maverick-FP8

Moreover, I noticed that this feature does not support 8bit KVcache. --kv-cache-dtype

Could we have more of your thoughts on this? I would appreciate a discussion on this matter to bring out the best user experience when it comes to using AITER kernels.

Does it benefits FP8 models or Bf16 and FP16 as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants