-
Notifications
You must be signed in to change notification settings - Fork 204
[Kernel] Remove cumsum in groupedmatmul #987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gate_up_out_list = torch_npu.npu_grouped_matmul( | ||
x=[expand_x], | ||
weight=[w1], | ||
split_item=2, | ||
group_list_type=0, | ||
group_list_type=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add some comments to explain why group_list_type=1
is used here, e.g., to avoid the cumulative calculation of the group list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vllm_ascend/ops/fused_moe.py
Outdated
|
||
if VLLM_ENABLE_FIX_ROUTE: | ||
uniform_group_list = hidden_states.shape[0] * all_to_all_group_size * top_k // moe_expert_num | ||
group_list = torch.Tensor([uniform_group_list] * w1.shape[0]).long().npu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group_list = torch.Tensor([uniform_group_list] * w1.shape[0]).long().npu() | |
group_list = torch.Tensor([uniform_group_list] * w1.shape[0]).long().to(hidden_states.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vllm_ascend/ops/fused_moe.py
Outdated
uniform_topk_list = [ | ||
(i + rank) % moe_expert_num for i in range(rank * step, (rank + 1) * step) | ||
] | ||
topk_ids = torch.Tensor(uniform_topk_list).int().view(hidden_states.shape[0], -1).npu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topk_ids = torch.Tensor(uniform_topk_list).int().view(hidden_states.shape[0], -1).npu() | |
topk_ids = torch.Tensor(uniform_topk_list).int().view(hidden_states.shape[0], -1).to(hidden_states.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
overall lgtm now, thanks! |
@wangxiyuan @Yikun @ganyi1996ppo hello, please review this pr, @MengqingCao has replied lgtm |
vllm_ascend/ops/fused_moe.py
Outdated
@@ -50,6 +51,14 @@ def fused_experts_with_mc2( | |||
) -> torch.Tensor: | |||
global_bs = 0 | |||
moe_expert_num = len(expert_map) | |||
|
|||
rank = torch.distributed.get_rank() | |||
if VLLM_ENABLE_FIX_ROUTE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance test under the fix route scenario seems pointless, do we really need this env variable in the repo?
Looks good except the fix routing part, have we really need to add this? Though the performance may gets better with this opened, but this performance data means nothing in the real production environment right? @hahazhky |
yes,we cannot open this in real production environment,but we can use it as a performance debug option to check the upper limit,and with the training continues,time-cost will be close to this performance |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
vllm_ascend/envs.py
Outdated
"VLLM_ENABLE_FIX_ROUTE": | ||
lambda: bool(int(os.getenv("VLLM_ENABLE_FIX_ROUTE", '0'))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"VLLM_ENABLE_FIX_ROUTE": | |
lambda: bool(int(os.getenv("VLLM_ENABLE_FIX_ROUTE", '0'))), | |
"VLLM_ASCEND_ENABLE_FIX_ROUTE": | |
lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_FIX_ROUTE", '0'))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
@@ -61,3 +61,22 @@ def test_models_distributed_DeepSeek(): | |||
distributed_executor_backend="mp", | |||
) as vllm_model: | |||
vllm_model.generate_greedy(example_prompts, max_tokens) | |||
|
|||
|
|||
def test_models_distributed_fix_route_DeepSeek(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should unset ENV after test executed: @patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_TOPK_OPTIMZE": "1"})
def test_models_distributed_fix_route_DeepSeek(): | |
@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_TOPK_OPTIMZE": "1"}) | |
def test_models_distributed_fix_route_DeepSeek(): |
[1] https://docs.python.org/3/library/unittest.mock.html#unittest.mock.patch.dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vllm_ascend/ops/fused_moe.py
Outdated
@@ -50,6 +51,14 @@ def fused_experts_with_mc2( | |||
) -> torch.Tensor: | |||
global_bs = 0 | |||
moe_expert_num = len(expert_map) | |||
|
|||
rank = torch.distributed.get_rank() | |||
if VLLM_ENABLE_FIX_ROUTE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure VLLM_ENABLE_FIX_ROUTE
is still needed or not after afc4c0c (enable_force_load_balance
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix route is no longer needed, delete relevant code
a00ba7b
to
936189e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except inline comments
"deepseek-ai/DeepSeek-V2-Lite", | ||
dtype=dtype, | ||
tensor_parallel_size=8, | ||
enable_expert_parallel=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this change more general: @pytest.mark.parametrize("enable_expert_parallel", [True, False])
?
with VllmRunner( | ||
"deepseek-ai/DeepSeek-V2-Lite", | ||
dtype=dtype, | ||
tensor_parallel_size=8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI fail due to only 4 cards avaiable but you used tp8
1ccff57
to
ac92d4b
Compare
Signed-off-by: zhky <[email protected]>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
588cd4b
to
db1d487
Compare
9a7bcaa
to
f057135
Compare
What this PR does / why we need it?
remove cumsum operator in MOE to improve performanceDoes this PR introduce any user-facing change?
How was this patch tested?
it should be tested on a case with mc2 operator and graph mode enabled