Parallelize bf16->f32 conversion for gemm(bf16:bf16->bf16) #147864

aditew01 · 2025-02-25T16:46:31Z

Improves performance for at::addmm / linear kernels when executed in dtype=bfloat16 and when SBGEMM is available.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01

pytorch-bot · 2025-02-25T16:46:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147864

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 257b0df with merge base d0f08dc ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgong5 · 2025-02-26T03:49:15Z

aten/src/ATen/native/CPUBlas.cpp

-      }
+      at::parallel_for(0, c_size, 1, [&](int64_t begin, int64_t end) {
+        for (const auto i : c10::irange(begin, end)) {
+          *(c++) = c10::convert<at::BFloat16>(float_v[i]);


Would it even be faster if we do a vectorized type cast here?

I'd say it'd be faster. I'm looking at how to plug the aten::vec in this.

@jgong5 please take a look.

peterbell10 · 2025-03-02T03:11:17Z

aten/src/ATen/native/CPUBlas.cpp

-          *(c++) = c10::convert<at::BFloat16>(float_v[i]);
+        int64_t i = begin;
+        //Vectorized Loop
+        for (; i + c_size <= end; i += c_size) {


This doesn't make sense, it will only ever take at most 1 trip since c_size is the upper bound for the loop.

Suggested change

for (; i + c_size <= end; i += c_size) {

for (; i + c_size <= end; i += Vectorized<float>::size()) {

peterbell10 · 2025-03-02T03:14:38Z

aten/src/ATen/native/CPUBlas.cpp

-      for (auto cv: float_v) {
-        *(c++) = c10::convert<at::BFloat16>(cv);
-      }
+      at::parallel_for(0, c_size, 1, [&](int64_t begin, int64_t end) {


Usually the grain size would be at::internal::GRAIN_SIZE which avoids introducing threading overhead for very small tensors.

peterbell10 · 2025-03-02T03:16:47Z

aten/src/ATen/native/CPUBlas.cpp

+        int64_t i = begin;
+        //Vectorized Loop
+        for (; i + c_size <= end; i += c_size) {
+            auto a_vec = at::vec::Vectorized<float>::loadu(&float_v[i]); // Load vec_size floats


Using Vectorized outside of the ATen/native/cpu/ directory will only use SSE. You would need to have a cpu kernel behind a DispatchStub to get AVX2 or AVX512 support.

aditew01 · 2025-03-17T10:30:27Z

[Close] in favour of this: OpenMathLib/OpenBLAS#5155

parallelize convert bf16->f32

fc06b01

aditew01 added module: cpu CPU specific problem (e.g., perf, algorithm) module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 topic: not user facing topic category labels Feb 25, 2025

pytorchbot added the open source label Feb 25, 2025

aditew01 requested review from jgong5, malfet and peterbell10 and removed request for jgong5 and malfet February 25, 2025 17:04

jgong5 reviewed Feb 26, 2025

View reviewed changes

vectorize conversion

257b0df

aditew01 requested a review from jgong5 February 28, 2025 17:10

jgong5 approved these changes Mar 1, 2025

View reviewed changes

peterbell10 requested changes Mar 2, 2025

View reviewed changes

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 3, 2025

aditew01 closed this Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelize bf16->f32 conversion for gemm(bf16:bf16->bf16) #147864

Parallelize bf16->f32 conversion for gemm(bf16:bf16->bf16) #147864

Uh oh!

aditew01 commented Feb 25, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Feb 25, 2025 •

edited

Loading

Uh oh!

jgong5 Feb 26, 2025

Uh oh!

aditew01 Feb 26, 2025

Uh oh!

aditew01 Feb 28, 2025

Uh oh!

peterbell10 Mar 2, 2025

Uh oh!

peterbell10 Mar 2, 2025

Uh oh!

peterbell10 Mar 2, 2025

Uh oh!

aditew01 commented Mar 17, 2025

Uh oh!

Uh oh!

	for (; i + c_size <= end; i += c_size) {
	for (; i + c_size <= end; i += Vectorized<float>::size()) {

Parallelize bf16->f32 conversion for gemm(bf16:bf16->bf16) #147864

Parallelize bf16->f32 conversion for gemm(bf16:bf16->bf16) #147864

Uh oh!

Conversation

aditew01 commented Feb 25, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147864

✅ No Failures

Uh oh!

jgong5 Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

aditew01 Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

aditew01 Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

peterbell10 Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

peterbell10 Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

peterbell10 Mar 2, 2025

Choose a reason for hiding this comment

Uh oh!

aditew01 commented Mar 17, 2025

Uh oh!

Uh oh!

aditew01 commented Feb 25, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 25, 2025 •

edited

Loading