-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Parallelize bf16->f32 conversion for gemm(bf16:bf16->bf16) #147864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147864
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 257b0df with merge base d0f08dc ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
aten/src/ATen/native/CPUBlas.cpp
Outdated
} | ||
at::parallel_for(0, c_size, 1, [&](int64_t begin, int64_t end) { | ||
for (const auto i : c10::irange(begin, end)) { | ||
*(c++) = c10::convert<at::BFloat16>(float_v[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it even be faster if we do a vectorized type cast here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it'd be faster. I'm looking at how to plug the aten::vec in this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgong5 please take a look.
*(c++) = c10::convert<at::BFloat16>(float_v[i]); | ||
int64_t i = begin; | ||
//Vectorized Loop | ||
for (; i + c_size <= end; i += c_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense, it will only ever take at most 1 trip since c_size
is the upper bound for the loop.
for (; i + c_size <= end; i += c_size) { | |
for (; i + c_size <= end; i += Vectorized<float>::size()) { |
for (auto cv: float_v) { | ||
*(c++) = c10::convert<at::BFloat16>(cv); | ||
} | ||
at::parallel_for(0, c_size, 1, [&](int64_t begin, int64_t end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the grain size would be at::internal::GRAIN_SIZE
which avoids introducing threading overhead for very small tensors.
int64_t i = begin; | ||
//Vectorized Loop | ||
for (; i + c_size <= end; i += c_size) { | ||
auto a_vec = at::vec::Vectorized<float>::loadu(&float_v[i]); // Load vec_size floats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Vectorized
outside of the ATen/native/cpu/
directory will only use SSE. You would need to have a cpu kernel behind a DispatchStub
to get AVX2 or AVX512 support.
[Close] in favour of this: OpenMathLib/OpenBLAS#5155 |
Improves performance for at::addmm / linear kernels when executed in dtype=bfloat16 and when SBGEMM is available.
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01