[ET-VK][Ops] linear_qta8a_qga4w_qta8o impl and shaders #12376

pytorchbot · 2025-07-10T22:46:33Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #12006 by @ahmtox
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/25/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/25/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/24/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/25/orig
@diff-train-skip-merge

Pull Request resolved: #12005 # Context This test framework establishes the foundation for validating the `linear_qta8a_qga4w` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware. The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic. This operator nomenclature breakdown: - **qta8a**: Quantized per-token affine 8-bit activation inputs - **qga4w**: Quantized per-group affine 4-bit weights # Changes The reference implementation (`linear_qta8a_qga4w_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`. This two-stage approach of dequantize → compute provides a clear reference against which the GPU's integer arithmetic implementation can be validated. ghstack-source-id: 295393632 @exported-using-ghexport Differential Revision: [D77173442](https://our.internmc.facebook.com/intern/diff/D77173442/)

Pull Request resolved: #12006 # Operator Description The linear_qta8a_qga4w operator implements a quantized linear transformation that enables efficient neural network inference through dynamic quantization. This operator performs matrix multiplication between quantized 8-bit activations and 4-bit grouped quantized weights, producing quantized 8-bit outputs. The quantization scheme follows the standard affine mapping where `real_value = scale * (quantized_value - zero_point)`. Input activations use 8-bit signed integers with per-token scale and zero-point parameters, while weights employ 4-bit quantization with group-wise parameters. # Implementation Architecture The operator provides two distinct computational approaches optimized for different matrix multiplication scenarios: the TILED algorithm for general matrix-matrix multiplication (GEMM) and the COOPERATIVE algorithm for matrix-vector multiplication (GEMV). ## TILED Algorithm (GEMM Cases) The tiled implementation processes the output matrix in rectangular blocks. Each thread is responsible for calculating a tile of output values, typically processing 3 rows and 2 columns worth of results in each iteration. The algorithm operates by having each thread load blocks of quantized weights and activations, perform integer arithmetic accumulation, and then apply the necessary scaling operations. Weight data is pre-packed in a specialized format where two 4-bit values are stored in each byte. Each thread loads multiple weight elements simultaneously and unpacks them during computation. The quantization parameters for weights are organized by groups, where each group of consecutive weight elements shares the same scale and zero-point values. ## COOPERATIVE Algorithm (GEMV Cases) The cooperative implementation uses shared memory and thread cooperation where this approach uses workgroups of 64 threads arranged as 8 groups of 8 workers each. The key insight is that GEMV operations have limited parallelism in the output dimension but substantial parallelism in the reduction dimension, making cooperative reduction strategies more effective than independent thread computation. Each group of 8 worker threads collaboratively computes a portion of the output vector. The workers divide the reduction work along the input feature dimension, with each worker processing every 8th element in a strided pattern. # Future Performance Improvements - Making use of dotPacked4x8EXT (this requires upgrading glslc and vulkan) - Fixed point math for pure integer operations - Might be more performant to avoid preloading tensors - Might also be more performant to avoid excessive register overhead by defining the ivec4 within each block operation (allowing more threads to be more register intensive) ghstack-source-id: 295447206 Differential Revision: [D77173441](https://our.internmc.facebook.com/intern/diff/D77173441/)

pytorch-bot · 2025-07-10T22:46:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12376

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 6 Pending, 4 Unrelated Failures

As of commit e899886 with merge base edf25c4 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / linux / linux-job (gh) (trunk failure)
devtools/inspector/tests/inspector_utils_test.py::TestInspectorUtils::test_equip_debug_handle_to_export_program_success
pull / unittest / macos / macos-job (gh) (trunk failure)
devtools/inspector/tests/inspector_utils_test.py::TestInspectorUtils::test_equip_debug_handle_to_export_program_success
pull / unittest-editable / linux / linux-job (gh) (trunk failure)
devtools/inspector/tests/inspector_utils_test.py::TestInspectorUtils::test_equip_debug_handle_to_export_program_success
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
devtools/inspector/tests/inspector_utils_test.py::TestInspectorUtils::test_equip_debug_handle_to_export_program_success

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-07-11T17:53:14Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

morelos added 2 commits July 10, 2025 11:26

pytorchbot requested a review from SS-JIA as a code owner July 10, 2025 22:46

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2025

Base automatically changed from gh/ahmtox/24/orig to main July 11, 2025 02:50

Merge branch 'main' into gh/ahmtox/25/orig

e899886

ahmtox self-requested a review July 11, 2025 17:52

ahmtox approved these changes Jul 11, 2025

View reviewed changes

ahmtox merged commit 4e956e6 into main Jul 11, 2025
93 of 97 checks passed

ahmtox deleted the gh/ahmtox/25/orig branch July 11, 2025 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK][Ops] linear_qta8a_qga4w_qta8o impl and shaders #12376

[ET-VK][Ops] linear_qta8a_qga4w_qta8o impl and shaders #12376

Uh oh!

pytorchbot commented Jul 10, 2025

Uh oh!

pytorch-bot bot commented Jul 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

[ET-VK][Ops] linear_qta8a_qga4w_qta8o impl and shaders #12376

[ET-VK][Ops] linear_qta8a_qga4w_qta8o impl and shaders #12376

Uh oh!

Conversation

pytorchbot commented Jul 10, 2025

Uh oh!

pytorch-bot bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12376

⏳ 6 Pending, 4 Unrelated Failures

Uh oh!

github-actions bot commented Jul 11, 2025

This PR needs a release notes: label

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 10, 2025 •

edited

Loading

This PR needs a `release notes:` label