[gfx1201] add tuned ck_gemm_a8w8_blockscale configs for various qwen3 models and default case by big-yellow-duck · Pull Request #1 · EmbeddedLLM/aiter

big-yellow-duck · 2026-03-13T09:32:31Z

Motivation

Add tuned configs for gfx1201 ck_gemm_a8w8_blockscale kernel, targeting various Qwen3 model variants. gfx1201 supports the FP8 dtype so these tuned configs speed up the gemm_a8w8_blockscale for inference in vLLM.

Technical Details

Tuning Process

The tuning was performed using the CK GEMM tuner:

cd csrc/ck_gemm_a8w8_blockscale
python gemm_a8w8_blockscale_tune.py \
    -i /app/aiter/aiter/configs/a8w8_blockscale_untuned_gemm.csv \
    -o <output.csv>

The tuned configurations are added to the existing GEMM configuration files and are automatically selected based on the input tensor dimensions and the target architecture (gfx1201).

Test Plan

The tuned kernels were validated using the GEMM test suite:

python op_tests/test_gemm_a8w8_blockscale.py --ck_preshuffle False

Tests cover various matrix dimensions (M: 1-10240, N: 24576, K: 1536) that are representative of Qwen3 inference workloads.

Test Result

dtype	m	n	k	ck_preshuffle	ck us	ck TFLOPS	ck TB/s
torch.bfloat16	1	24576	1536	False	62.2419	1.21297	0.606509
torch.bfloat16	2	24576	1536	False	63.0116	2.39631	0.599125
torch.bfloat16	4	24576	1536	False	63.6247	4.74642	0.593399
torch.bfloat16	8	24576	1536	False	64.1253	9.41875	0.588863
torch.bfloat16	16	24576	1536	False	65.0905	18.5582	0.58032
torch.bfloat16	32	24576	1536	False	68.2576	35.3942	0.553754
torch.bfloat16	64	24576	1536	False	80.5625	59.9763	0.469785
torch.bfloat16	96	24576	1536	False	122.323	59.251	0.309805
torch.bfloat16	128	24576	1536	False	149.707	64.5505	0.253464
torch.bfloat16	160	24576	1536	False	191.106	63.209	0.198814
torch.bfloat16	192	24576	1536	False	225.946	64.1546	0.168375
torch.bfloat16	224	24576	1536	False	255.417	66.211	0.14914
torch.bfloat16	256	24576	1536	False	282.715	68.3633	0.134913
torch.bfloat16	288	24576	1536	False	320.135	67.9191	0.119297
torch.bfloat16	320	24576	1536	False	354.852	68.0824	0.107764
torch.bfloat16	352	24576	1536	False	386.663	68.7293	0.0990251
torch.bfloat16	384	24576	1536	False	415.049	69.8496	0.0923711
torch.bfloat16	416	24576	1536	False	451.077	69.6265	0.0851023
torch.bfloat16	448	24576	1536	False	484.864	69.7575	0.0792736
torch.bfloat16	480	24576	1536	False	516.881	70.1105	0.0744581
torch.bfloat16	512	24576	1536	False	544.023	71.0534	0.0708337
torch.bfloat16	1024	24576	1536	False	1038.07	74.4739	0.0378794
torch.bfloat16	2048	24576	1536	False	2032.53	76.072	0.02012
torch.bfloat16	4096	24576	1536	False	3959.11	78.1079	0.0111238
torch.bfloat16	6144	24576	1536	False	5945.33	78.0203	0.00793664
torch.bfloat16	8192	24576	1536	False	7990.64	77.3999	0.00629882
torch.bfloat16	10240	24576	1536	False	10011.5	77.2203	0.00534157

All tests pass with zero error, confirming the correctness of the tuned configurations.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-03-13T09:32:44Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 1 --add-label <label>

add quant kernel support for gfx1201

ecec44e

big-yellow-duck closed this Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gfx1201] add tuned ck_gemm_a8w8_blockscale configs for various qwen3 models and default case#1

[gfx1201] add tuned ck_gemm_a8w8_blockscale configs for various qwen3 models and default case#1
big-yellow-duck wants to merge 1 commit intomainfrom
rdna4-quant-support

big-yellow-duck commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

big-yellow-duck commented Mar 13, 2026

Motivation

Technical Details

Tuning Process

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions bot commented Mar 13, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant