Skip to content

Fix issue in recompiling kernel with double GRF mode. #4287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 28, 2025

Conversation

chengjunlu
Copy link
Contributor

@chengjunlu chengjunlu commented May 23, 2025

The name used in try-catch block shadows the names in outside block. Failed to return the new kernel on exit of the try-catch block. Use the new name inside the try-catch block can fix the issue.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a shadowing issue with the debug flag and ensures that the rebuilt kernel in large GRF mode is correctly returned, with proper cleanup and final logging.

  • Extracts the debugEnabled optional once at function scope to avoid shadowing in the try block
  • Renames the inner compile results to *_dgrf variants and swaps them with the outer values when beneficial
  • Adds cleanup of the unused Level Zero module/kernel and a final debug log after recompilation
Comments suppressed due to low confidence (3)

third_party/intel/backend/driver.c:240

  • Checking the optional debugEnabled with if (debugEnabled) only tests for presence of a value, not its truth. Use if (debugEnabled.value_or(false)) to correctly guard debug logs.
if (debugEnabled)

third_party/intel/backend/driver.c:230

  • [nitpick] Variable debugEnabled uses camelCase, which is inconsistent with the snake_case naming (e.g., n_spills) used elsewhere in this file. Consider renaming to debug_enabled for consistency.
const std::optional<bool> debugEnabled =

third_party/intel/backend/driver.c:239

  • The new recompilation branch for large GRF mode isn’t covered by existing tests. Add a unit or integration test that simulates n_spills > max_reg_spill to verify the fallback and cleanup behavior.
if (!is_GRF_mode_specified && n_spills > max_reg_spill) {

@alexbaden
Copy link
Contributor

I am confused - looking through the code right now it looks like the automatic failover for the case when number of spills exceeds a certain threshold is not working because new l0_module, l0_kernel objects are created inside the try/catch block but are not used. Is that correct? Is that what this PR attempts to fix? It also looks like you also change the behavior to only use the large GRF mode if it results in fewer spills. Seems reasonable, but I wonder if it's worth the extra complexity of having to maintain both objects, and then manually delete the older ones.

I also think we need to find a way to unit test this code - especially if it is true that it has been broken all this time.

<< std::endl;
if (n_spills_dgrf < n_spills) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances would this happen?

Copy link
Contributor

@whitneywhtsang whitneywhtsang May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think in most cases number of spills should be less with double GRF, e.g., GEMM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most case the condition is true always. But it is just for some conner case we don't know yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we should be adding this logic for corner cases we don’t know about. For one thing, we can’t test it. For another, it introduces a bunch of operations that involve containers we don’t control and can’t introspect. This opens us up to gnarly bugs in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. We'd better not to over protect the in-possible case.

@kurapov-peter
Copy link
Contributor

It looks to me that it didn't work, yup.

Seems reasonable, but I wonder if it's worth the extra complexity of having to maintain both objects, and then manually delete the older ones.

I think that would be the case even without the new behavior. I would only question the check.

@etiotto
Copy link
Contributor

etiotto commented May 23, 2025

Any performance impact on Triton Benchmarks ?

@chengjunlu chengjunlu force-pushed the chengjun/fix_driver_issue branch 3 times, most recently from b4c0fca to ed3605b Compare May 26, 2025 06:51
@chengjunlu
Copy link
Contributor Author

Any performance impact on Triton Benchmarks ?

I kick off a benchmark action. https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15248204645

@etiotto
Copy link
Contributor

etiotto commented May 26, 2025

Any performance impact on Triton Benchmarks ?

I kick off a benchmark action. https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15248204645

Ok. Please confirm whether there are performance degradations or not.

@alexbaden
Copy link
Contributor

I still don’t understand if the current automatic large GRF mode is currently broken, or not. If it’s broken then we should see some impact in the benchmarks, no? Regardless I don’t think we should land this until we have a test included so we don’t go through this again.

@chengjunlu
Copy link
Contributor Author

chengjunlu commented May 27, 2025

Any performance impact on Triton Benchmarks ?

I kick off a benchmark action. https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15248204645

Ok. Please confirm whether there are performance degradations or not.

There is no performance regression with the PR.

Observe Performance improvement on two benchmarks:

  1. flash-attn-backward raise to 5 from 3.
    image
  2. flex-attn-masks raise to 20 from 10.
    image

The main reason is that the double GRF is not enabled in those two benchmarks original.

@chengjunlu
Copy link
Contributor Author

I still don’t understand if the current automatic large GRF mode is currently broken, or not. If it’s broken then we should see some impact in the benchmarks, no? Regardless I don’t think we should land this until we have a test included so we don’t go through this again.

The current automatic large GRF mode is broken. We observed performance increased of two benchmark which is not enable the double GRF by default.

@chengjunlu chengjunlu force-pushed the chengjun/fix_driver_issue branch from ed3605b to ccda9d2 Compare May 27, 2025 03:11
Copy link
Contributor

@alexbaden alexbaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test and clarifying the problem - this all makes sense now, the code block handling more spills under double GRF was confusing me.

It is interesting that we didn't see much performance impact from breaking this change in the benchmarks. I suppose that is because we set the GRF mode explicitly as part of the auto-tuner config in most of the benchmarks. On the NVIDIA side the number of registers used is returned as a tuning parameter for upstream libraries (e.g. Inductor) to use. I wonder if we could re-work this feature to do something similar. Do other libraries (e.g. vLLM) use n_regs returned by NVIDIA as a tuning parameter?

<< std::endl;

std::swap(l0_module, l0_module_dgrf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this swap is a good idea. It would be better to encapsulate everything in a struct and use RAII to destroy, vs the swap and manual destroy. If we want to make an issue and do this as a follow-up I would be ok with that, though.


// clean up the unused module and kernel.
auto error_no = zeKernelDestroy(l0_kernel_dgrf);
if (error_no != ZE_RESULT_SUCCESS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's return the error code with the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use this issue to track this #4334

@chengjunlu
Copy link
Contributor Author

Thanks for adding the test and clarifying the problem - this all makes sense now, the code block handling more spills under double GRF was confusing me.

It is interesting that we didn't see much performance impact from breaking this change in the benchmarks. I suppose that is because we set the GRF mode explicitly as part of the auto-tuner config in most of the benchmarks. On the NVIDIA side the number of registers used is returned as a tuning parameter for upstream libraries (e.g. Inductor) to use. I wonder if we could re-work this feature to do something similar. Do other libraries (e.g. vLLM) use n_regs returned by NVIDIA as a tuning parameter?

We can discus about the idea in tech meeting. From my aspect, I think the n_regs is not aligned to Xe, Xe2 or Xe3. If we use the n_regs, it might give wrong information to the user.

@alexbaden
Copy link
Contributor

Right, we can't just return n_regs because for one thing we don't currently get the number of registers used. But the goal is the same - provide the user with feedback about register pressure so they can make changes in the tuning configuration for the kernel. Yes, NVIDIA does not have an explicit parameter for setting the register file size, but the tradeoff is the same - bigger register file == less parallelism.

@chengjunlu chengjunlu merged commit 664df8a into main May 28, 2025
16 checks passed
@chengjunlu chengjunlu deleted the chengjun/fix_driver_issue branch May 28, 2025 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DRIVER] The auto GRF mode in Triton driver doesn't work as expected.
5 participants