-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Hardware][AMD] integrate aiter chunked prefill into vllm #18596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
|
||
import aiter as rocm_aiter | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Zzz9990
I think we don't need to use the try catch statement. Because the registration must work as vLLM is going to deprecate V0. If registration does not work when aiter is present on ROCm env, this could mean there is a bug.
An example unit tests to check if the registration works is as follows https://github.com/vllm-project/vllm/blob/main/tests/kernels/moe/test_rocm_aiter_topk.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: fsx950223 <[email protected]>
max_seqlen_k = attn_metadata.max_seq_len | ||
block_table = attn_metadata.block_table | ||
|
||
if max_seqlen_q > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean max_query_len
here to check that there are prefills in batch to process in flash_attn_varlen_func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. The FP8 speedups look good! It looks like you can clean up the backend a bit by removing the cascade attention logic, but otherwise this PR looks reasonable.
) | ||
return output | ||
else: | ||
raise NotImplementedError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since cascade attention is not supported, can you just remove all of the setup code for it?
) -> None: | ||
if blocksparse_params is not None: | ||
raise ValueError( | ||
"FlashAttention does not support block-sparse attention.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this to "AiterFlashAttention does not support block-sparse attention"? Can you do this for this whole file to make the error message and warning more explicit?
AiterFlashAttentionBackend.get_supported_head_sizes() | ||
if head_size not in support_head_sizes: | ||
raise ValueError( | ||
f"Head size {head_size} is not supported by FlashAttention. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update this to "AiterFlashAttention"?
"triton_attn.TritonAttentionBackend") | ||
if envs.VLLM_ROCM_USE_AITER and envs.VLLM_ROCM_USE_AITER_MHA \ | ||
and on_mi250_mi300(): | ||
logger.info("Using Flash Attention backend on V1 engine.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update this to "AiterFlashAttention" ?
v=v, | ||
cu_seqlens_q=cu_seqlens_q, | ||
max_seqlen_q=max_seqlen_q, | ||
min_seqlen_q=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This min_seqlen_q appears to be a new parameter in the aiter.
Which aiter commit is required for it to work?
How is it guarded against the version that is currently in our dockers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
648764942e552a8bb5fe16026703716a81f05374
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsx950223 can you add this bug fix for vLLM as AITER commit: 648764942e552a8bb5fe16026703716a81f05374 has changes in the enum of fused moe
vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py
Please add this line:
class QuantMethod(IntEnum):
# This allows interfacing with AITER QuantType Enum
# without importing the QuantType from AITER globally.
# Note that these quantization methods are
# supported in AITER package. However,
# not all are used in this module.
NO = 0 # a16w16
PER_TENSOR = 1 # w8a8 (pre_Tensor)
PER_TOKEN = 2 # w8a8/w8a4 (per_Token)
+ BLOCK_1X32 = 3 # block quantized w8a8 1x32)
BLOCK_1X128 = 4 # block quantized w8a8 (per_1x128)
BLOCK_128x128 = 5 # block quantized w8a8 (per_128x128)
Then try to run DeepSeekV3 R1 to validate the code path
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
vllm serve \
--model deepseek-ai/DeepSeek-V3 \
-tp 8 \
--no-enable-chunked-prefill \
--max-model-len 32768 \
--trust-remote-code \
--block-size 1 \
--max_seq_len_to_capture 32768 \
--trust-remote-code --swap-space 16 --distributed-executor-backend mp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsx950223 can we know why min_seqlen_q=1
while the default value is set to 0 in aiter package?. would changing this min value to higher number be affecting the performance? what happens if min_seqlen_q
is set to the same value as max_seqlen_q
? Asking this cuz this kernel is also used in ROCM_AITER_MLA backend as well for chunked prefill as shown below for both v0 and v1 engine.
vllm/v1/attention/backends/mla/rocm_aiter_mla.py
vllm/vllm/v1/attention/backends/mla/rocm_aiter_mla.py
Lines 160 to 179 in 4f4a6b8
from aiter import flash_attn_varlen_func | |
self.flash_attn_varlen_func = flash_attn_varlen_func | |
def _flash_attn_varlen_diff_headdims(self, | |
q, | |
k, | |
v, | |
return_softmax_lse=False, | |
softmax_scale=None, | |
**kwargs): | |
output = self.flash_attn_varlen_func( | |
q=q, | |
k=k, | |
v=v, | |
softmax_scale=softmax_scale, | |
return_lse=return_softmax_lse, | |
**kwargs, | |
) | |
return output |
vllm/attention/backends/rocm_aiter_mla.py
vllm/vllm/attention/backends/rocm_aiter_mla.py
Lines 388 to 402 in 4f4a6b8
from aiter import flash_attn_varlen_func | |
self.flash_attn_varlen_func = flash_attn_varlen_func | |
def _flash_attn_varlen_diff_headdims( | |
self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, | |
softmax_scale: float, return_softmax_lse: bool, | |
**kwargs) -> Union[tuple[torch.Tensor, ...], torch.Tensor]: | |
output = self.flash_attn_varlen_func( | |
q, | |
k, | |
v, | |
**kwargs, | |
) | |
return output |
Signed-off-by: fsx950223 <[email protected]>
@Zzz9990 can we have more holistic evaluation of whether this AITER MHA benefits most models in general when compared with chunckedprefill attention before we set this AITER MHA on by default? What you have tested with might be unified attention instead. On my end, the chucked prefill attention is many cases are better than unified attention. On Qwen3 235B FP8 v1. Some representative models suggested to be evaluated: Moreover, I noticed that this feature does not support 8bit KVcache. Could we have more of your thoughts on this? I would appreciate a discussion on this matter to bring out the best user experience when it comes to using AITER kernels. Does it benefits FP8 models or Bf16 and FP16 as well? |
CMD: VLLM_TORCH_PROFILER_DIR=/mnt/raid0/sixifang/vllm/vllm_profile HIP_VISIBLE_DEVICES=4,5,6,7 VLLM_ROCM_USE_AITER=1 VLLM_USE_V1=1 vllm serve /models/models--amd--Meta-Llama-3.1-8B-Instruct-FP8-KV/snapshots/fa42f9a9105c545755fea25cf69f49ac8c8b40e1/ --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --trust-remote-code --disable-log-requests --block-size 16 --max-model-len 32768 --dtype float16 --quantization fp8 --no-enable-prefix-caching --max-num-batched-tokens=8192
Performance with aiter:
Performance without aiter: