-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Reduce urKernelRetain, Release calls when not using kernel bundle or RTC #18324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Reduce urKernelRetain, Release calls when not using kernel bundle or RTC #18324
Conversation
5248a81
to
189534d
Compare
@Alexandr-Konovalov FYI. PR is ready for review. |
Performance: https://intel.github.io/llvm/benchmarks/?runs=Baseline_PVC_L0%2CPR18324_PVC_L0 (max 3-4% improvement) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I understand correctly that the RW lock you introduced is per the whole cache, not per cache entry? How is this approach compared to #18565?
// (6) The writer is only allowed to delete an entry, we anyway do not support | ||
// modyfying an entry in the cache. | ||
|
||
void acquireReaderLock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to invent your own RW lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the closest to reader/writer lock in C++ is shared_lock
(https://en.cppreference.com/w/cpp/thread/shared_lock.html) and unique_lock
. Both of these operate over a mutex. In my implementation, I've used an atomic variable instead, which I suppose will be faster than mutex here as contention between threads is low (w'll evict from cache rarely). In my understanding, for simple atomic counter-like applications, std::atomic
performs better than mutex as the former can leverage HW support for atomic ops while mutex would also require OS support (like futex
syscall on Linux?).
@@ -2703,6 +2703,8 @@ void enqueueImpKernel( | |||
|
|||
std::shared_ptr<kernel_impl> SyclKernelImpl; | |||
std::shared_ptr<device_image_impl> DeviceImageImpl; | |||
// Transfer ownership only of cache is enabled. | |||
const bool TransferownerShipToCache = SYCLConfig<SYCL_CACHE_IN_MEM>::get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be TransferOwnershipToCache
instead of TransferownerShipToCache
?
@@ -2703,6 +2703,8 @@ void enqueueImpKernel( | |||
|
|||
std::shared_ptr<kernel_impl> SyclKernelImpl; | |||
std::shared_ptr<device_image_impl> DeviceImageImpl; | |||
// Transfer ownership only of cache is enabled. | |||
const bool TransferownerShipToCache = SYCLConfig<SYCL_CACHE_IN_MEM>::get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think it breaks encapsulation: the cache config should be read inside the cache implementation, not by the caller of the cache
we need to test with the v2 adapter. |
As we agreed at the meeting, we will proceed with the approach in #18565. |
Problem
Every time we fetch a kernel/program from in-memory cache, we make a call to
urKernelRetain
/urKernelRelease
/urProgramRelease
/urProgramRetain
. These calls are expensive.Solution (Proposed in this PR)
Case 1 (When in-memory cache is enabled):
Instead of a shared ownership, transfer the ownership of kernel/program handles to the cache. Cache manages the lifetime of program/kernel handle and so there are no more calls to urKernelRetain/Release when fetching items from cache. During program shutdown, cache will be destroyed, and the kernel, program objects it holds will be also released.
Consider the case when one thread is using the kernel/program handle fetched from cache and another thread just evicted that item from cache, perhaps due to OOM. In this case, we might run into a use-after-free bug. To prevent that, this PR, guards cache access with a reader, writer lock. Threads fetching from cache will hold a reader lock which will be release when kernel/program handles are no longer needed. Thread trying to evict from cache will hold a writer lock.
Case 2 (When in-memory cache is explicitly disabled):
When cache is disabled, there no transfer of ownership to cache (obviously) and the code will work just like it works right now.