Fix issue in recompiling kernel with double GRF mode. #4287

chengjunlu · 2025-05-23T04:25:33Z

The name used in try-catch block shadows the names in outside block. Failed to return the new kernel on exit of the try-catch block. Use the new name inside the try-catch block can fix the issue.

Copilot

Pull Request Overview

This PR fixes a shadowing issue with the debug flag and ensures that the rebuilt kernel in large GRF mode is correctly returned, with proper cleanup and final logging.

Extracts the debugEnabled optional once at function scope to avoid shadowing in the try block
Renames the inner compile results to *_dgrf variants and swaps them with the outer values when beneficial
Adds cleanup of the unused Level Zero module/kernel and a final debug log after recompilation

Comments suppressed due to low confidence (3)

third_party/intel/backend/driver.c:240

Checking the optional debugEnabled with if (debugEnabled) only tests for presence of a value, not its truth. Use if (debugEnabled.value_or(false)) to correctly guard debug logs.

if (debugEnabled)

third_party/intel/backend/driver.c:230

[nitpick] Variable debugEnabled uses camelCase, which is inconsistent with the snake_case naming (e.g., n_spills) used elsewhere in this file. Consider renaming to debug_enabled for consistency.

const std::optional<bool> debugEnabled =

third_party/intel/backend/driver.c:239

The new recompilation branch for large GRF mode isn’t covered by existing tests. Add a unit or integration test that simulates n_spills > max_reg_spill to verify the fallback and cleanup behavior.

if (!is_GRF_mode_specified && n_spills > max_reg_spill) {

alexbaden · 2025-05-23T12:40:30Z

I am confused - looking through the code right now it looks like the automatic failover for the case when number of spills exceeds a certain threshold is not working because new l0_module, l0_kernel objects are created inside the try/catch block but are not used. Is that correct? Is that what this PR attempts to fix? It also looks like you also change the behavior to only use the large GRF mode if it results in fewer spills. Seems reasonable, but I wonder if it's worth the extra complexity of having to maintain both objects, and then manually delete the older ones.

I also think we need to find a way to unit test this code - especially if it is true that it has been broken all this time.

alexbaden · 2025-05-23T12:48:32Z

third_party/intel/backend/driver.c

                      << std::endl;
+          if (n_spills_dgrf < n_spills) {


Under what circumstances would this happen?

I would think in most cases number of spills should be less with double GRF, e.g., GEMM.

For most case the condition is true always. But it is just for some conner case we don't know yet.

I don’t think we should be adding this logic for corner cases we don’t know about. For one thing, we can’t test it. For another, it introduces a bunch of operations that involve containers we don’t control and can’t introspect. This opens us up to gnarly bugs in the future.

Make sense. We'd better not to over protect the in-possible case.

kurapov-peter · 2025-05-23T12:48:39Z

It looks to me that it didn't work, yup.

Seems reasonable, but I wonder if it's worth the extra complexity of having to maintain both objects, and then manually delete the older ones.

I think that would be the case even without the new behavior. I would only question the check.

etiotto · 2025-05-23T16:56:02Z

Any performance impact on Triton Benchmarks ?

chengjunlu · 2025-05-26T06:55:15Z

Any performance impact on Triton Benchmarks ?

I kick off a benchmark action. https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15248204645

etiotto · 2025-05-26T22:46:08Z

Any performance impact on Triton Benchmarks ?

I kick off a benchmark action. https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15248204645

Ok. Please confirm whether there are performance degradations or not.

alexbaden · 2025-05-26T23:01:47Z

I still don’t understand if the current automatic large GRF mode is currently broken, or not. If it’s broken then we should see some impact in the benchmarks, no? Regardless I don’t think we should land this until we have a test included so we don’t go through this again.

chengjunlu · 2025-05-27T01:32:11Z

Any performance impact on Triton Benchmarks ?

I kick off a benchmark action. https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/15248204645

Ok. Please confirm whether there are performance degradations or not.

There is no performance regression with the PR.

Observe Performance improvement on two benchmarks:

flash-attn-backward raise to 5 from 3.
flex-attn-masks raise to 20 from 10.

The main reason is that the double GRF is not enabled in those two benchmarks original.

chengjunlu · 2025-05-27T01:38:10Z

I still don’t understand if the current automatic large GRF mode is currently broken, or not. If it’s broken then we should see some impact in the benchmarks, no? Regardless I don’t think we should land this until we have a test included so we don’t go through this again.

The current automatic large GRF mode is broken. We observed performance increased of two benchmark which is not enable the double GRF by default.

Signed-off-by: Lu,Chengjun <[email protected]>

alexbaden

Thanks for adding the test and clarifying the problem - this all makes sense now, the code block handling more spills under double GRF was confusing me.

It is interesting that we didn't see much performance impact from breaking this change in the benchmarks. I suppose that is because we set the GRF mode explicitly as part of the auto-tuner config in most of the benchmarks. On the NVIDIA side the number of registers used is returned as a tuning parameter for upstream libraries (e.g. Inductor) to use. I wonder if we could re-work this feature to do something similar. Do other libraries (e.g. vLLM) use n_regs returned by NVIDIA as a tuning parameter?

alexbaden · 2025-05-27T13:02:44Z

third_party/intel/backend/driver.c

                      << std::endl;
+
+          std::swap(l0_module, l0_module_dgrf);


I don't think this swap is a good idea. It would be better to encapsulate everything in a struct and use RAII to destroy, vs the swap and manual destroy. If we want to make an issue and do this as a follow-up I would be ok with that, though.

alexbaden · 2025-05-27T13:02:53Z

third_party/intel/backend/driver.c

+
+          // clean up the unused module and kernel.
+          auto error_no = zeKernelDestroy(l0_kernel_dgrf);
+          if (error_no != ZE_RESULT_SUCCESS) {


Let's return the error code with the error message.

Let's use this issue to track this #4334

chengjunlu · 2025-05-27T13:54:24Z

Thanks for adding the test and clarifying the problem - this all makes sense now, the code block handling more spills under double GRF was confusing me.

It is interesting that we didn't see much performance impact from breaking this change in the benchmarks. I suppose that is because we set the GRF mode explicitly as part of the auto-tuner config in most of the benchmarks. On the NVIDIA side the number of registers used is returned as a tuning parameter for upstream libraries (e.g. Inductor) to use. I wonder if we could re-work this feature to do something similar. Do other libraries (e.g. vLLM) use n_regs returned by NVIDIA as a tuning parameter?

We can discus about the idea in tech meeting. From my aspect, I think the n_regs is not aligned to Xe, Xe2 or Xe3. If we use the n_regs, it might give wrong information to the user.

alexbaden · 2025-05-27T14:00:40Z

Right, we can't just return n_regs because for one thing we don't currently get the number of registers used. But the goal is the same - provide the user with feedback about register pressure so they can make changes in the tuning configuration for the kernel. Yes, NVIDIA does not have an explicit parameter for setting the register file size, but the tradeoff is the same - bigger register file == less parallelism.

chengjunlu requested review from Copilot, anmyachev and alexbaden May 23, 2025 04:25

Copilot AI reviewed May 23, 2025

View reviewed changes

kurapov-peter approved these changes May 23, 2025

View reviewed changes

alexbaden reviewed May 23, 2025

View reviewed changes

chengjunlu force-pushed the chengjun/fix_driver_issue branch 3 times, most recently from b4c0fca to ed3605b Compare May 26, 2025 06:51

chengjunlu force-pushed the chengjun/fix_driver_issue branch from ed3605b to ccda9d2 Compare May 27, 2025 03:11

Fix issue in recompiling kernel with double GRF mode.

ccda9d2

Signed-off-by: Lu,Chengjun <[email protected]>

etiotto approved these changes May 27, 2025

View reviewed changes

alexbaden approved these changes May 27, 2025

View reviewed changes

whitneywhtsang approved these changes May 27, 2025

View reviewed changes

chengjunlu mentioned this pull request May 28, 2025

[Driver] Enhance the error handling in driver.c #4334

Open

chengjunlu merged commit 664df8a into main May 28, 2025
16 checks passed

chengjunlu deleted the chengjun/fix_driver_issue branch May 28, 2025 02:17

chengjunlu linked an issue May 28, 2025 that may be closed by this pull request

[DRIVER] The auto GRF mode in Triton driver doesn't work as expected. #4286

Closed

Fix issue in recompiling kernel with double GRF mode. #4287

Fix issue in recompiling kernel with double GRF mode. #4287

Uh oh!

Conversation

chengjunlu commented May 23, 2025 • edited by whitneywhtsang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

alexbaden commented May 23, 2025

Uh oh!

alexbaden May 23, 2025

Choose a reason for hiding this comment

Uh oh!

whitneywhtsang May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chengjunlu May 26, 2025

Choose a reason for hiding this comment

Uh oh!

alexbaden May 26, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu May 27, 2025

Choose a reason for hiding this comment

Uh oh!

kurapov-peter commented May 23, 2025

Uh oh!

etiotto commented May 23, 2025

Uh oh!

chengjunlu commented May 26, 2025

Uh oh!

etiotto commented May 26, 2025

Uh oh!

alexbaden commented May 26, 2025

Uh oh!

chengjunlu commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chengjunlu commented May 27, 2025

Uh oh!

alexbaden left a comment

Choose a reason for hiding this comment

Uh oh!

alexbaden May 27, 2025

Choose a reason for hiding this comment

Uh oh!

alexbaden May 27, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu May 28, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu commented May 27, 2025

Uh oh!

alexbaden commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

chengjunlu commented May 23, 2025 •

edited by whitneywhtsang

Loading

whitneywhtsang May 23, 2025 •

edited

Loading

chengjunlu commented May 27, 2025 •

edited

Loading