-
Notifications
You must be signed in to change notification settings - Fork 62
Fix issue in recompiling kernel with double GRF mode. #4287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a shadowing issue with the debug flag and ensures that the rebuilt kernel in large GRF mode is correctly returned, with proper cleanup and final logging.
- Extracts the
debugEnabled
optional once at function scope to avoid shadowing in the try block - Renames the inner compile results to
*_dgrf
variants and swaps them with the outer values when beneficial - Adds cleanup of the unused Level Zero module/kernel and a final debug log after recompilation
Comments suppressed due to low confidence (3)
third_party/intel/backend/driver.c:240
- Checking the optional
debugEnabled
withif (debugEnabled)
only tests for presence of a value, not its truth. Useif (debugEnabled.value_or(false))
to correctly guard debug logs.
if (debugEnabled)
third_party/intel/backend/driver.c:230
- [nitpick] Variable
debugEnabled
uses camelCase, which is inconsistent with the snake_case naming (e.g.,n_spills
) used elsewhere in this file. Consider renaming todebug_enabled
for consistency.
const std::optional<bool> debugEnabled =
third_party/intel/backend/driver.c:239
- The new recompilation branch for large GRF mode isn’t covered by existing tests. Add a unit or integration test that simulates
n_spills > max_reg_spill
to verify the fallback and cleanup behavior.
if (!is_GRF_mode_specified && n_spills > max_reg_spill) {
I am confused - looking through the code right now it looks like the automatic failover for the case when number of spills exceeds a certain threshold is not working because new I also think we need to find a way to unit test this code - especially if it is true that it has been broken all this time. |
third_party/intel/backend/driver.c
Outdated
<< std::endl; | ||
if (n_spills_dgrf < n_spills) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances would this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think in most cases number of spills should be less with double GRF, e.g., GEMM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most case the condition is true always. But it is just for some conner case we don't know yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we should be adding this logic for corner cases we don’t know about. For one thing, we can’t test it. For another, it introduces a bunch of operations that involve containers we don’t control and can’t introspect. This opens us up to gnarly bugs in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. We'd better not to over protect the in-possible case.
It looks to me that it didn't work, yup.
I think that would be the case even without the new behavior. I would only question the check. |
Any performance impact on Triton Benchmarks ? |
b4c0fca
to
ed3605b
Compare
I kick off a benchmark action. https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15248204645 |
Ok. Please confirm whether there are performance degradations or not. |
I still don’t understand if the current automatic large GRF mode is currently broken, or not. If it’s broken then we should see some impact in the benchmarks, no? Regardless I don’t think we should land this until we have a test included so we don’t go through this again. |
There is no performance regression with the PR. Observe Performance improvement on two benchmarks: The main reason is that the double GRF is not enabled in those two benchmarks original. |
The current automatic large GRF mode is broken. We observed performance increased of two benchmark which is not enable the double GRF by default. |
ed3605b
to
ccda9d2
Compare
Signed-off-by: Lu,Chengjun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test and clarifying the problem - this all makes sense now, the code block handling more spills under double GRF was confusing me.
It is interesting that we didn't see much performance impact from breaking this change in the benchmarks. I suppose that is because we set the GRF mode explicitly as part of the auto-tuner config in most of the benchmarks. On the NVIDIA side the number of registers used is returned as a tuning parameter for upstream libraries (e.g. Inductor) to use. I wonder if we could re-work this feature to do something similar. Do other libraries (e.g. vLLM) use n_regs returned by NVIDIA as a tuning parameter?
<< std::endl; | ||
|
||
std::swap(l0_module, l0_module_dgrf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this swap is a good idea. It would be better to encapsulate everything in a struct and use RAII to destroy, vs the swap and manual destroy. If we want to make an issue and do this as a follow-up I would be ok with that, though.
|
||
// clean up the unused module and kernel. | ||
auto error_no = zeKernelDestroy(l0_kernel_dgrf); | ||
if (error_no != ZE_RESULT_SUCCESS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's return the error code with the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use this issue to track this #4334
We can discus about the idea in tech meeting. From my aspect, I think the |
Right, we can't just return |
The name used in try-catch block shadows the names in outside block. Failed to return the new kernel on exit of the try-catch block. Use the new name inside the try-catch block can fix the issue.