Skip to content

[gfx1201] add tuned ck_gemm_a8w8_blockscale configs for various qwen3 models and default case#1

Closed
big-yellow-duck wants to merge 1 commit intomainfrom
rdna4-quant-support
Closed

[gfx1201] add tuned ck_gemm_a8w8_blockscale configs for various qwen3 models and default case#1
big-yellow-duck wants to merge 1 commit intomainfrom
rdna4-quant-support

Conversation

@big-yellow-duck
Copy link

Motivation

Add tuned configs for gfx1201 ck_gemm_a8w8_blockscale kernel, targeting various Qwen3 model variants. gfx1201 supports the FP8 dtype so these tuned configs speed up the gemm_a8w8_blockscale for inference in vLLM.

Technical Details

Tuning Process

The tuning was performed using the CK GEMM tuner:

cd csrc/ck_gemm_a8w8_blockscale
python gemm_a8w8_blockscale_tune.py \
    -i /app/aiter/aiter/configs/a8w8_blockscale_untuned_gemm.csv \
    -o <output.csv>

The tuned configurations are added to the existing GEMM configuration files and are automatically selected based on the input tensor dimensions and the target architecture (gfx1201).

Test Plan

The tuned kernels were validated using the GEMM test suite:

python op_tests/test_gemm_a8w8_blockscale.py --ck_preshuffle False

Tests cover various matrix dimensions (M: 1-10240, N: 24576, K: 1536) that are representative of Qwen3 inference workloads.

Test Result

dtype m n k ck_preshuffle ck us ck TFLOPS ck TB/s ck err
torch.bfloat16 1 24576 1536 False 62.2419 1.21297 0.606509 0
torch.bfloat16 2 24576 1536 False 63.0116 2.39631 0.599125 0
torch.bfloat16 4 24576 1536 False 63.6247 4.74642 0.593399 0
torch.bfloat16 8 24576 1536 False 64.1253 9.41875 0.588863 0
torch.bfloat16 16 24576 1536 False 65.0905 18.5582 0.58032 0
torch.bfloat16 32 24576 1536 False 68.2576 35.3942 0.553754 0
torch.bfloat16 64 24576 1536 False 80.5625 59.9763 0.469785 0
torch.bfloat16 96 24576 1536 False 122.323 59.251 0.309805 0
torch.bfloat16 128 24576 1536 False 149.707 64.5505 0.253464 0
torch.bfloat16 160 24576 1536 False 191.106 63.209 0.198814 0
torch.bfloat16 192 24576 1536 False 225.946 64.1546 0.168375 0
torch.bfloat16 224 24576 1536 False 255.417 66.211 0.14914 0
torch.bfloat16 256 24576 1536 False 282.715 68.3633 0.134913 0
torch.bfloat16 288 24576 1536 False 320.135 67.9191 0.119297 0
torch.bfloat16 320 24576 1536 False 354.852 68.0824 0.107764 0
torch.bfloat16 352 24576 1536 False 386.663 68.7293 0.0990251 0
torch.bfloat16 384 24576 1536 False 415.049 69.8496 0.0923711 0
torch.bfloat16 416 24576 1536 False 451.077 69.6265 0.0851023 0
torch.bfloat16 448 24576 1536 False 484.864 69.7575 0.0792736 0
torch.bfloat16 480 24576 1536 False 516.881 70.1105 0.0744581 0
torch.bfloat16 512 24576 1536 False 544.023 71.0534 0.0708337 0
torch.bfloat16 1024 24576 1536 False 1038.07 74.4739 0.0378794 0
torch.bfloat16 2048 24576 1536 False 2032.53 76.072 0.02012 0
torch.bfloat16 4096 24576 1536 False 3959.11 78.1079 0.0111238 0
torch.bfloat16 6144 24576 1536 False 5945.33 78.0203 0.00793664 0
torch.bfloat16 8192 24576 1536 False 7990.64 77.3999 0.00629882 0
torch.bfloat16 10240 24576 1536 False 10011.5 77.2203 0.00534157 0

All tests pass with zero error, confirming the correctness of the tuned configurations.

Submission Checklist

@github-actions
Copy link

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 1 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant