Skip to content

[CPU] Enable DA8W4 on CPU #2128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Apr 25, 2025

Summary
This PR enables DA8W4 on CPU.

  • It adds a new layout Int8DynamicActInt4WeightCPULayout and its implementation
  • It adds two custom ops:
    • da8w4_linear_prepack_cpu for weight packing
    • da8w4_linear_cpu for A8W4 GEMM.
  • It adds C++ kernels for the two new custom ops

The feature supports symmetric and asymmetric quantization of activation.

The ops and kernels won't be available unless

  • torchao is built from source with USE_CPP_KERNELS=1 on Linux with an X86 CPU with AVX512.
  • torchao is run on Linux with an X86 CPU with AVX512.
  • PyTorch version >= 2.7

To get the best performance, one needs a CPU with AMX support.

Implementation details

  • The weight-packing kernel is implemented with AVX512 intrinsics if available. Otherwise, a reference path is used.
  • The GEMM kernel uses at::cpublas brgemm utilities from Pytorch core if available.
  • In the GEMM kernel, if M is large (>4)
    • if brgemm is available, brgemm is used.
    • otherwise, fallback to reference implementation
  • In the GEMM kernel, if M is small (<=4):
    • if AVX512_VNNI is available, the kernel uses AVX512_VNNI intrinsics.
    • otherwise, go to the same path for large M.
  • All utilities functions used in the kernel are implemented with AVX512 if available. Otherwise fall back to reference implementation.

Usage

quantize_(
    model,
    int8_dynamic_activation_int4_weight(
        group_size=32,  # or 64, 128
        layout=Int8DynamicActInt4WeightCPULayout(),
        act_mapping_type=MappingType.SYMMETRIC,  # or MappingType.ASYMMETRIC
    ),
)

Test plan

pytest test/quantization/test_quant_api.py -k test_8da4w_cpu

Copy link

pytorch-bot bot commented Apr 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2128

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2c5a799 with merge base 35ffb26 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2025
@Xia-Weiwen Xia-Weiwen added cpu quantize topic: new feature Use this tag if this PR adds a new feature labels Apr 25, 2025
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review April 29, 2025 02:01
@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 April 29, 2025 03:16
@Xia-Weiwen Xia-Weiwen marked this pull request as draft May 7, 2025 01:17
@Xia-Weiwen
Copy link
Collaborator Author

@leslie-fang-intel This PR is updated to use a new layout. Please review again. Thanks.

@Xia-Weiwen Xia-Weiwen changed the title [CPU] enable int8_dynamic_activation_int4_weight with Int4CPULayout [CPU] enable int8_dynamic_activation_int4_weight on CPU May 16, 2025
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review May 16, 2025 05:59
@Xia-Weiwen Xia-Weiwen changed the title [CPU] enable int8_dynamic_activation_int4_weight on CPU [CPU] Add a new layout for int8_dynamic_activation_int4_weight on CPU May 16, 2025
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? Thanks.

2 similar comments
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? Thanks.

@Xia-Weiwen Xia-Weiwen marked this pull request as draft May 21, 2025 02:57
@Xia-Weiwen Xia-Weiwen removed the request for review from jerryzh168 May 21, 2025 02:57
@Xia-Weiwen Xia-Weiwen changed the title [CPU] Add a new layout for int8_dynamic_activation_int4_weight on CPU [CPU] Enable DA8W4 on CPU May 26, 2025
@Xia-Weiwen
Copy link
Collaborator Author

Hi @leslie-fang-intel Please review this PR again. I have also added the kernel code in this PR. It showed reasonable performance in internal benchmarks. Thanks.

Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also describe how we choose different implementations based on the CPU Info.

if (use_cpublas_checked) {
return use_cpublas;
}
use_cpublas = at::native::cpublas::could_pack(at::kByte);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires AMX to make this check as True but in the setup.py which only requires vnni. I think we should align these.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. We don't need the AMX flag here because the code in Torchao does not depends on AMX. brgemm is in torch core, not here. We need AVX512_VNNI because we use AVX512_VNNI intrinsics explicitly here in Torchao.
Another example is the INT8 SDPA implementation, where only AVX512 flag is used because it only has AVX512 code in Torchao and it also uses brgemm from torch core.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can change the name of this function to da8w4_use_cpublas? If it means packing the weight to the format for cpublas is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I have changed it to cpublas_can_pack.

@Xia-Weiwen
Copy link
Collaborator Author

Please also describe how we choose different implementations based on the CPU Info.

I have added more details in the description. Thanks.

if (use_cpublas_checked) {
return use_cpublas;
}
use_cpublas = at::native::cpublas::could_pack(at::kByte);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can change the name of this function to da8w4_use_cpublas? If it means packing the weight to the format for cpublas is needed.

}
#endif

#if defined(CPU_CAPABILITY_AVX512_VNNI)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be a issue if user build the package on platform with VNNI support but run it on legacy platform.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I have added a runtime check.

@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 June 6, 2025 01:49
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review June 6, 2025 01:49
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? Thanks. It's changed a lot since your last review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. cpu quantize topic: new feature Use this tag if this PR adds a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants