-
Notifications
You must be signed in to change notification settings - Fork 768
[SYCL] Use shared_ptr instead of manual changing UR counters #18565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Use shared_ptr instead of manual changing UR counters #18565
Conversation
Keep kernel and program handle in a structure with lifetime controlled by shared_ptr. This is faster wrt current implementation, because only one atomic operation is required for copying-destroying shared_ptr, while pair of kernel/program retain/release requires at least 2 atomic operations. Signed-off-by: Alexandr Konovalov <[email protected]>
Signed-off-by: Alexandr Konovalov <[email protected]>
Signed-off-by: Alexandr Konovalov <[email protected]>
Signed-off-by: Alexandr Konovalov <[email protected]>
@intel/dpcpp-nativecpu-reviewers please review. |
I'm not sure why NativeCPU needs to review this. The diff shows |
Sorry, my mistake. Removed. |
|
||
struct FastKernelCacheVal { | ||
ur_kernel_handle_t MKernelHandle; /* UR kernel handle pointer. */ | ||
std::mutex *MMutex; /* Mutex guarding this kernel. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, no, as we pass nullptr
, when caching is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest making such a change in a separate PR if we agree it is needed. This PR just changes the ref counting schema for the cache, but not the data structure fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, no, as we pass
nullptr
, when caching is disabled.
This is important, can you mention that in the in-code comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, no, as we pass
nullptr
, when caching is disabled.This is important, can you mention that in the in-code comment?
Done.
const KernelArgMask *MKernelArgMask; /* Eliminated kernel argument mask. */ | ||
ur_program_handle_t MProgramHandle; /* UR program handle corresponding to | ||
this kernel. */ | ||
const Adapter *MAdapterPtr; /* We can keep raw pointer to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
MKernelArgMask(KernelArgMask), MProgramHandle(ProgramHandle), | ||
MAdapterPtr(AdapterPtr) {} | ||
|
||
~FastKernelCacheVal() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set to nullptr
after "Release"? It should be relatively cheap but can save lots of time debugging obscure errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
KernelCacheVal = detail::ProgramManager::getInstance().getOrCreateKernel( | ||
ContextImpl, DeviceImpl, KernelName, KernelNameBasedCachePtr, NDRDesc); | ||
Kernel = KernelCacheVal->MKernelHandle; | ||
KernelMutex = KernelCacheVal->MMutex; | ||
Program = KernelCacheVal->MProgramHandle; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often do we need to extract data like that? Would enabling structured bindings for this class make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in that particular code snippet, we cannot use structured bindings because the Kernel
, KernelMutex
, Program
, and EliminatedArgMask
variables are declared somewhere above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And additionally, reference counting is now supported by shared_ptr
, so we have to keep shared_ptr
, to not get race condition (use handler after decrementing counters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't std::tie
allow to use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but I do not think so. As I understand the KernelCacheVal
should be convertible to std::tuple
.
Anyway, let's keep it out of scope for this PR.
Co-authored-by: aelovikov-intel <[email protected]>
I haven't looked through this in details, so won't be LGTM'ing, but I'm not having any outstanding requests either. Feel free to merge once all is green. |
Colleagues @fabiomestre , @sergey-semenov, @againull , could it be merged or some further improvements are required? |
Keep kernel and program handle in a structure that lifetime is controlled by shared_ptr. This is faster wrt current implementation, because only one atomic operation is required for copying-destroying shared_ptr, while pair of kernel/program retain/release calls requires 2 atomic operations in the best case.