[CPU] Enable DA8W4 on CPU #2128

Xia-Weiwen · 2025-04-25T10:22:16Z

Summary
This PR enables DA8W4 on CPU.

It adds a new layout Int8DynamicActInt4WeightCPULayout and its implementation
It adds two custom ops:
- da8w4_linear_prepack_cpu for weight packing
- da8w4_linear_cpu for A8W4 GEMM.
It adds C++ kernels for the two new custom ops

The feature supports symmetric and asymmetric quantization of activation.

The ops and kernels won't be available unless

torchao is built from source with USE_CPP_KERNELS=1 on Linux with an X86 CPU with AVX512.
torchao is run on Linux with an X86 CPU with AVX512.
PyTorch version >= 2.7

To get the best performance, one needs a CPU with AMX support.

Implementation details

The weight-packing kernel is implemented with AVX512 intrinsics if available. Otherwise, a reference path is used.
The GEMM kernel uses at::cpublas brgemm utilities from Pytorch core if available.
In the GEMM kernel, if M is large (>4)
- if brgemm is available, brgemm is used.
- otherwise, fallback to reference implementation
In the GEMM kernel, if M is small (<=4):
- if AVX512_VNNI is available, the kernel uses AVX512_VNNI intrinsics.
- otherwise, go to the same path for large M.
All utilities functions used in the kernel are implemented with AVX512 if available. Otherwise fall back to reference implementation.

Usage

quantize_(
    model,
    int8_dynamic_activation_int4_weight(
        group_size=32,  # or 64, 128
        layout=Int8DynamicActInt4WeightCPULayout(),
        act_mapping_type=MappingType.SYMMETRIC,  # or MappingType.ASYMMETRIC
    ),
)

Test plan

pytest test/quantization/test_quant_api.py -k test_8da4w_cpu

pytorch-bot · 2025-04-25T10:22:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2128

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2c5a799 with merge base 35ffb26 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/quantization/test_quant_api.py

Xia-Weiwen · 2025-05-14T10:22:42Z

@leslie-fang-intel This PR is updated to use a new layout. Please review again. Thanks.

torchao/dtypes/uintx/int4_cpu_layout.py

Xia-Weiwen · 2025-05-16T09:51:35Z

Hi @jerryzh168 Could you please review this PR? Thanks.

Xia-Weiwen · 2025-05-19T02:09:46Z

Hi @jerryzh168 Could you please review this PR? Thanks.

Xia-Weiwen · 2025-05-20T14:36:51Z

Hi @jerryzh168 Could you please review this PR? Thanks.

Xia-Weiwen · 2025-06-04T06:47:41Z

Hi @leslie-fang-intel Please review this PR again. I have also added the kernel code in this PR. It showed reasonable performance in internal benchmarks. Thanks.

leslie-fang-intel

Please also describe how we choose different implementations based on the CPU Info.

leslie-fang-intel · 2025-06-04T11:58:19Z

torchao/csrc/cpu/da8w4_linear.cpp

+  if (use_cpublas_checked) {
+    return use_cpublas;
+  }
+  use_cpublas = at::native::cpublas::could_pack(at::kByte);


It requires AMX to make this check as True but in the setup.py which only requires vnni. I think we should align these.

Thanks for the comments. We don't need the AMX flag here because the code in Torchao does not depends on AMX. brgemm is in torch core, not here. We need AVX512_VNNI because we use AVX512_VNNI intrinsics explicitly here in Torchao.
Another example is the INT8 SDPA implementation, where only AVX512 flag is used because it only has AVX512 code in Torchao and it also uses brgemm from torch core.

May be we can change the name of this function to da8w4_use_cpublas? If it means packing the weight to the format for cpublas is needed.

Thanks for the comment. I have changed it to cpublas_can_pack.

Xia-Weiwen · 2025-06-04T15:16:16Z

Please also describe how we choose different implementations based on the CPU Info.

I have added more details in the description. Thanks.

leslie-fang-intel · 2025-06-05T06:21:52Z

torchao/csrc/cpu/da8w4_linear.cpp

+  if (use_cpublas_checked) {
+    return use_cpublas;
+  }
+  use_cpublas = at::native::cpublas::could_pack(at::kByte);


May be we can change the name of this function to da8w4_use_cpublas? If it means packing the weight to the format for cpublas is needed.

leslie-fang-intel · 2025-06-05T06:33:16Z

torchao/csrc/cpu/da8w4_linear.cpp

+}
+#endif
+
+#if defined(CPU_CAPABILITY_AVX512_VNNI)


Will it be a issue if user build the package on platform with VNNI support but run it on legacy platform.

Thanks for the comment. I have added a runtime check.

Xia-Weiwen · 2025-06-06T01:50:07Z

Hi @jerryzh168 Could you please review this PR? Thanks. It's changed a lot since your last review.

[CPU] enable int8_dynamic_activation_int4_weight with Int4CPULayout

0581451

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2025

Merge branch 'main' into da8w4_with_int4_cpu_layout

dffbbab

Xia-Weiwen added cpu quantize topic: new feature Use this tag if this PR adds a new feature labels Apr 25, 2025

Xia-Weiwen added 2 commits April 25, 2025 03:27

Fix format issue

9fb7f77

Merge branch 'main' into da8w4_with_int4_cpu_layout

35ece3b

Xia-Weiwen requested a review from leslie-fang-intel April 28, 2025 11:02

jerryzh168 reviewed Apr 28, 2025

View reviewed changes

test/quantization/test_quant_api.py Outdated Show resolved Hide resolved

leslie-fang-intel approved these changes Apr 29, 2025

View reviewed changes

Xia-Weiwen marked this pull request as ready for review April 29, 2025 02:01

Xia-Weiwen requested a review from jerryzh168 April 29, 2025 03:16

Xia-Weiwen marked this pull request as draft May 7, 2025 01:17

Xia-Weiwen added 2 commits May 11, 2025 20:08

Merge branch 'main' into da8w4_with_int4_cpu_layout

c5b6d87

Add Int8DynamicActInt4WeightCPULayout

8e80d03

Xia-Weiwen requested a review from leslie-fang-intel May 14, 2025 10:22

Merge branch 'main' into da8w4_with_int4_cpu_layout

51249c3

leslie-fang-intel reviewed May 15, 2025

View reviewed changes

torchao/dtypes/uintx/int4_cpu_layout.py Show resolved Hide resolved

Xia-Weiwen changed the title ~~[CPU] enable int8_dynamic_activation_int4_weight with Int4CPULayout~~ [CPU] enable int8_dynamic_activation_int4_weight on CPU May 16, 2025

remove dispatch for t()

3e20172

Xia-Weiwen marked this pull request as ready for review May 16, 2025 05:59

Xia-Weiwen changed the title ~~[CPU] enable int8_dynamic_activation_int4_weight on CPU~~ [CPU] Add a new layout for int8_dynamic_activation_int4_weight on CPU May 16, 2025

Merge branch 'main' into da8w4_with_int4_cpu_layout

e765664

Xia-Weiwen marked this pull request as draft May 21, 2025 02:57

Xia-Weiwen removed the request for review from jerryzh168 May 21, 2025 02:57

Xia-Weiwen added 3 commits May 23, 2025 00:30

Add cpp kernel for weight packing and GEMM

4feac3f

Register ATQ linear dispatch for da8w4 linear

0d85183

Fix issues with torch.compile

c42abdb

Xia-Weiwen changed the title ~~[CPU] Add a new layout for int8_dynamic_activation_int4_weight on CPU~~ [CPU] Enable DA8W4 on CPU May 26, 2025

Xia-Weiwen added 6 commits May 25, 2025 19:34

Merge branch 'main' into da8w4_with_int4_cpu_layout

e2815ce

Fix DA8W4CPUAQTTensorImpl.get_plain

8c5eebb

Test DA8W4CPUAQTTensorImpl.get_plain in UT

2a26e15

Skip UT if CPP kernel not built

369000f

Add AVX512_VNNI implementation for small M

f6e87ba

improve performance

0a87ef0

Xia-Weiwen requested a review from leslie-fang-intel June 4, 2025 06:29

leslie-fang-intel reviewed Jun 4, 2025

View reviewed changes

Xia-Weiwen added 3 commits June 4, 2025 14:02

Support symmetric quantization of activation

e05b96a

Merge branch 'main' into da8w4_with_int4_cpu_layout

fd6e4b1

Refine code

18335c6

leslie-fang-intel reviewed Jun 5, 2025

View reviewed changes

leslie-fang-intel approved these changes Jun 5, 2025

View reviewed changes

Xia-Weiwen added 2 commits June 5, 2025 14:53

Refine code

66ab77f

Merge branch 'main' into da8w4_with_int4_cpu_layout

2c5a799

Xia-Weiwen requested a review from jerryzh168 June 6, 2025 01:49

Xia-Weiwen marked this pull request as ready for review June 6, 2025 01:49

[CPU] Enable DA8W4 on CPU #2128

Are you sure you want to change the base?

[CPU] Enable DA8W4 on CPU #2128

Conversation

Xia-Weiwen commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2128

✅ No Failures

Uh oh!

Uh oh!

Xia-Weiwen commented May 14, 2025

Uh oh!

Uh oh!

Xia-Weiwen commented May 16, 2025

Uh oh!

Xia-Weiwen commented May 19, 2025

Uh oh!

Xia-Weiwen commented May 20, 2025

Uh oh!

Xia-Weiwen commented Jun 4, 2025

Uh oh!

leslie-fang-intel left a comment

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen commented Jun 4, 2025

Uh oh!

leslie-fang-intel Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen commented Jun 6, 2025

Uh oh!

Uh oh!

Xia-Weiwen commented Apr 25, 2025 •

edited

Loading

pytorch-bot bot commented Apr 25, 2025 •

edited

Loading