Skip to content

Conversation

@rasmith
Copy link
Contributor

@rasmith rasmith commented Jan 23, 2026

Purpose

This PR updates concat_and_cache_ds_mla_kernel to use the correct scale divisor when running on gfx942 architectures on ROCm. The tile_size divisor was 448.0 which does not work on AMD platforms with arch of gfx942, e.g. MI300, MI325. This PR updates the divisor to be 224.0 on gfx942.

Additionally, I consolidated the scale divisor into a constexpr float.

Test Plan

Use lm_eval to check accuracy on DeepSeek-R1 model.

Command to run:
lm_eval --model vllm --model_args pretrained=/models/DeepSeek-R1,max_length=8192,tensor_parallel_size=8 --batch_size auto --tasks gsm8k --num_fewshot 5

Test Result

I did a comparison using lm_eval for with and without this PR.

I ran:
lm_eval --model vllm --model_args pretrained=/models/DeepSeek-R1,max_length=8192,tensor_parallel_size=8 --batch_size auto --tasks gsm8k --num_fewshot 5
on MI300 and got the following results:

with this PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9575|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.9583|±  |0.0055|

without this PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9522|±  |0.0059|
|     |       |strict-match    |     5|exact_match|↑  |0.9522|±  |0.0059|

Essential Elements of an Effective PR Description Checklist
  • [ X] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ X] The test plan, such as providing test command.
  • [ X] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Randall Smith <ransmith@amd.com>
@mergify mergify bot added rocm Related to AMD ROCm bug Something isn't working labels Jan 23, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a bug in concat_and_cache_ds_mla_kernel for the gfx942 architecture by using the appropriate scaling factor. The change is consistent with other parts of the codebase for the same architecture. My review includes one suggestion to improve maintainability by refactoring the hardcoded scaling factors into a constant, which will make the code easier to read and prevent potential future inconsistencies.

Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Copy link
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants