Skip to content

[SYCL] Reduce urKernelRetain, Release calls when not using kernel bundle or RTC #18324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

uditagarwal97
Copy link
Contributor

@uditagarwal97 uditagarwal97 commented May 5, 2025

Problem
Every time we fetch a kernel/program from in-memory cache, we make a call to urKernelRetain/urKernelRelease/urProgramRelease/urProgramRetain. These calls are expensive.

Solution (Proposed in this PR)

Case 1 (When in-memory cache is enabled):

Instead of a shared ownership, transfer the ownership of kernel/program handles to the cache. Cache manages the lifetime of program/kernel handle and so there are no more calls to urKernelRetain/Release when fetching items from cache. During program shutdown, cache will be destroyed, and the kernel, program objects it holds will be also released.

  • (a) Caveat: (When eviction is used or cache has to be cleared due to OOM)
    Consider the case when one thread is using the kernel/program handle fetched from cache and another thread just evicted that item from cache, perhaps due to OOM. In this case, we might run into a use-after-free bug. To prevent that, this PR, guards cache access with a reader, writer lock. Threads fetching from cache will hold a reader lock which will be release when kernel/program handles are no longer needed. Thread trying to evict from cache will hold a writer lock.

Case 2 (When in-memory cache is explicitly disabled):

When cache is disabled, there no transfer of ownership to cache (obviously) and the code will work just like it works right now.

@uditagarwal97 uditagarwal97 force-pushed the udit/remove_ur_kernel_retain_call branch from 5248a81 to 189534d Compare May 28, 2025 03:14
@uditagarwal97
Copy link
Contributor Author

@Alexandr-Konovalov FYI. PR is ready for review.

@uditagarwal97 uditagarwal97 marked this pull request as ready for review May 28, 2025 05:15
@uditagarwal97 uditagarwal97 requested a review from a team as a code owner May 28, 2025 05:15
@uditagarwal97
Copy link
Contributor Author

uditagarwal97 commented May 28, 2025

Performance: https://intel.github.io/llvm/benchmarks/?runs=Baseline_PVC_L0%2CPR18324_PVC_L0 (max 3-4% improvement)

@uditagarwal97 uditagarwal97 requested a review from vinser52 May 28, 2025 05:39
Copy link
Contributor

@vinser52 vinser52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I understand correctly that the RW lock you introduced is per the whole cache, not per cache entry? How is this approach compared to #18565?

// (6) The writer is only allowed to delete an entry, we anyway do not support
// modyfying an entry in the cache.

void acquireReaderLock() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to invent your own RW lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the closest to reader/writer lock in C++ is shared_lock (https://en.cppreference.com/w/cpp/thread/shared_lock.html) and unique_lock. Both of these operate over a mutex. In my implementation, I've used an atomic variable instead, which I suppose will be faster than mutex here as contention between threads is low (w'll evict from cache rarely). In my understanding, for simple atomic counter-like applications, std::atomic performs better than mutex as the former can leverage HW support for atomic ops while mutex would also require OS support (like futex syscall on Linux?).

@@ -2703,6 +2703,8 @@ void enqueueImpKernel(

std::shared_ptr<kernel_impl> SyclKernelImpl;
std::shared_ptr<device_image_impl> DeviceImageImpl;
// Transfer ownership only of cache is enabled.
const bool TransferownerShipToCache = SYCLConfig<SYCL_CACHE_IN_MEM>::get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be TransferOwnershipToCache instead of TransferownerShipToCache?

@@ -2703,6 +2703,8 @@ void enqueueImpKernel(

std::shared_ptr<kernel_impl> SyclKernelImpl;
std::shared_ptr<device_image_impl> DeviceImageImpl;
// Transfer ownership only of cache is enabled.
const bool TransferownerShipToCache = SYCLConfig<SYCL_CACHE_IN_MEM>::get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think it breaks encapsulation: the cache config should be read inside the cache implementation, not by the caller of the cache

@vinser52
Copy link
Contributor

Performance: https://intel.github.io/llvm/benchmarks/?runs=Baseline_PVC_L0%2CPR18324_PVC_L0 (max 3-4% improvement)

we need to test with the v2 adapter.

@vinser52
Copy link
Contributor

As we agreed at the meeting, we will proceed with the approach in #18565.

@uditagarwal97 uditagarwal97 deleted the udit/remove_ur_kernel_retain_call branch May 28, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants