-
Notifications
You must be signed in to change notification settings - Fork 741
Using generic implementation for 16-bit activations and 8-bit weights for matmul in backends #16008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… for linear in backends Summary: # Context We continue from D84284794 to add support for 16-bit activations. Note that right now, all though they support 16-bit activations already, it's only if the weights are also 16-bits. To do this, we need to change the way we template some functions. # Current Behavior Right now, we're composing two macros together, the `ET_FORALL_JARVIS_QUANTIZED_TYPES_WITH_INT16` macro: https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/operators.h?lines=22-25 and the function macro(`quantized_linear` chosen for example): https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/quantized_linear_out.cpp?lines=30-41 so together, it just becomes a switch statement, calling the `quantized_linear` function with the correct template parameter. However, note that it assumes that both the input activations and weights are the same dtype, which is not the case. # This Diff We finish by using the generic implementation for all the backends and adding e2e tests as well as unit tests. Reviewed By: hsharma35 Differential Revision: D87946776
…for Conv2D in Backends (pytorch#16007) Summary: # Context We continue from D84284794 to add support for 16-bit activations. Note that right now, all though they support 16-bit activations already, it's only if the weights are also 16-bits. To do this, we need to change the way we template some functions. # Current Behavior Right now, we're composing two macros together, the `ET_FORALL_JARVIS_QUANTIZED_TYPES_WITH_INT16` macro: https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/operators.h?lines=22-25 and the function macro(`quantized_linear` chosen for example): https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/quantized_linear_out.cpp?lines=30-41 so together, it just becomes a switch statement, calling the `quantized_linear` function with the correct template parameter. However, note that it assumes that both the input activations and weights are the same dtype, which is not the case. # This Diff We finish by using the generic implementation for all the backends and adding e2e tests as well as unit tests. Differential Revision: D87993325
… for matmul in backends Summary: # Context We continue from D84284794 to add support for 16-bit activations. Note that right now, all though they support 16-bit activations already, it's only if the weights are also 16-bits. To do this, we need to change the way we template some functions. # Current Behavior Right now, we're composing two macros together, the `ET_FORALL_JARVIS_QUANTIZED_TYPES_WITH_INT16` macro: https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/operators.h?lines=22-25 and the function macro(`quantized_linear` chosen for example): https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/quantized_linear_out.cpp?lines=30-41 so together, it just becomes a switch statement, calling the `quantized_linear` function with the correct template parameter. However, note that it assumes that both the input activations and weights are the same dtype, which is not the case. # This Diff We now use the generic implementation in the backends and create a unit test as well as e2e tests. Differential Revision: D87997149
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16008
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for 16-bit activations with 8-bit weights (W8A16) for quantized operations in the Cadence HiFi backend by delegating to generic implementations when the heterogeneous type combination is detected.
Key Changes:
- Modified backend operators (matmul, linear, conv2d) to detect and handle int16 activation + int8 weight combinations
- Added build dependencies on generic operator implementations for W8A16 support
- Created comprehensive unit tests for the new W8A16 functionality
- Added new quantizer classes for 16-bit activation support in conv and matmul operations
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| backends/cadence/hifi/operators/op_quantized_matmul_out.cpp | Added W8A16 type check and generic implementation call for int16 activations |
| backends/cadence/hifi/operators/op_quantized_linear_out.cpp | Added W8A16 type checks for both per-channel and per-tensor variants |
| backends/cadence/hifi/operators/op_quantized_conv2d_nchw_out.cpp | Added W8A16 handling for NCHW format convolutions |
| backends/cadence/hifi/operators/op_quantized_conv2d_nhwc_out.cpp | Added W8A16 handling for NHWC format convolutions |
| backends/cadence/hifi/operators/targets.bzl | Updated build configuration with generic operator dependencies and removed redundant operator entries |
| backends/cadence/hifi/operators/op_quantized_matmul_out.h | Added public header for quantized_matmul_out function declaration |
| backends/cadence/hifi/operators/tests/test_op_quantized_matmul_out.cpp | Added unit tests for W8A16 matmul with transposed and non-transposed cases |
| backends/cadence/hifi/operators/tests/test_op_quantized_linear_out.cpp | Added unit test for W8A16 linear operation |
| backends/cadence/hifi/operators/tests/test_op_quantized_conv2d_out.cpp | Added unit tests for W8A16 conv2d in both NCHW and NHWC formats |
| backends/cadence/aot/quantizer/quantizer.py | Added quantizer classes for 16-bit activation support in conv and matmul patterns |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| else if (out.scalar_type() == executorch::aten::ScalarType::Byte) { |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after closing brace on line 238. The 'else' should be on the same line as the closing brace of the preceding if block, or there should be consistent formatting with other similar conditionals in the file.
| } | |
| else if (out.scalar_type() == executorch::aten::ScalarType::Byte) { | |
| } else if (out.scalar_type() == executorch::aten::ScalarType::Byte) { |
| } | ||
| else if (out.scalar_type() == executorch::aten::ScalarType::Byte) { |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after closing brace on line 296. The 'else' should be on the same line as the closing brace of the preceding if block, or there should be consistent formatting with other similar conditionals in the file.
| } | |
| else if (out.scalar_type() == executorch::aten::ScalarType::Byte) { | |
| } else if (out.scalar_type() == executorch::aten::ScalarType::Byte) { |
| // With all ones input and weights, and kernel size 3x5=15 * 8 channels = 120 | ||
| // After applying out_multiplier (0.5 * 2^31), the value is scaled by 0.5 | ||
| // Expected value: 120 * 0.5 = 60 | ||
| Tensor expected = tf_int16.full({1, 16, 18, 24}, 120); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected output value (120) doesn't match the comment's calculation (120 * 0.5 = 60). Based on the comment logic and the out_multiplier value, the expected tensor should contain 60, not 120.
| Tensor expected = tf_int16.full({1, 16, 18, 24}, 120); | |
| Tensor expected = tf_int16.full({1, 16, 18, 24}, 60); |
| // With all ones input and weights, and kernel size 3x5=15 * 8 channels = 120 | ||
| // After applying out_multiplier (0.5 * 2^31), the value is scaled by 0.5 | ||
| // Expected value: 120 * 0.5 = 60 | ||
| Tensor expected = tf_int16.full({1, 18, 24, 16}, 120); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected output value (120) doesn't match the comment's calculation (120 * 0.5 = 60). Based on the comment logic and the out_multiplier value, the expected tensor should contain 60, not 120.
| Tensor expected = tf_int16.full({1, 18, 24, 16}, 120); | |
| Tensor expected = tf_int16.full({1, 18, 24, 16}, 60); |
| return; | ||
| } | ||
|
|
||
| bool optimized = 0; |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable declaration for 'optimized' appears after the early return, but it was previously at the beginning of the function. This creates inconsistent code structure and the variable is now unreachable when the W8A16 path is taken. Consider moving this declaration back before the W8A16 conditional check for consistency with the original code structure.
| // Using simple values for testing | ||
| Tensor X = tf_int16.ones({64, 33}); | ||
| Tensor Y = tf_int8.ones({33, 128}); | ||
| // Bias not used |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment states 'Bias not used' but a bias tensor is created and passed to the function. Either remove the misleading comment or clarify that while bias is passed, it may not affect the expected output in this test case.
| // Bias not used | |
| // Bias is passed but does not affect the expected output in this test case |
| // Bias not used | ||
| Tensor bias = tf_int32.full({128}, -30); |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment states 'Bias not used' but a bias tensor is created and passed to the function. Either remove the misleading comment or clarify that while bias is passed, it may not affect the expected output in this test case.
Summary:
Context
We continue from D84284794 to add support for 16-bit activations. Note that right now, all though they support 16-bit activations already, it's only if the weights are also 16-bits. To do this, we need to change the way we template some functions.
Current Behavior
Right now, we're composing two macros together, the
ET_FORALL_JARVIS_QUANTIZED_TYPES_WITH_INT16macro:https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/operators.h?lines=22-25
and the function macro(
quantized_linearchosen for example):https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/quantized_linear_out.cpp?lines=30-41
so together, it just becomes a switch statement, calling the
quantized_linearfunction with the correct template parameter.However, note that it assumes that both the input activations and weights are the same dtype, which is not the case.
This Diff
We now use the generic implementation in the backends and create a unit test as well as e2e tests.
Differential Revision: D87997149