-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Simplify UTF-16 validation Vector128 codepath #121981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Combine the SSE2 codepath with a more generic Vector128 algorithm. AdvSimd is handled slightly differently to avoid using Vector128 ExtractMostSignificantBits, because there is no such equivalent instruction on Arm so the performance would be very slow otherwise.
|
cc @dotnet/arm64-contrib @a74nh @SwapnilGaikwad @tannergooding @EgorBo |
| } | ||
|
|
||
| // Otherwise, encode the mask with 8-bits (one byte), where each bit represents one element. | ||
| return cmp.ExtractMostSignificantBits(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we call GetSurrogateMask for LessThan input (hence, we have a guarantee that all elements are either 0 or allbitsset) -- we have a long standing issue to optimize this in JIT instead (to avoid special casing all ExtractMostSignificantBits calls in the managed code), cc @tannergooding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the current way of wrapping ExtractMostSignificantBits make it difficult for the JIT to detect the optimization? And is it a x86 specific optimization or is there something similar we can do for Arm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The managed implementations of these use ExtractMostSignificantBits, but they are marked [Intrinsic] and the JIT IR doesn't have to.
The general point of these APIs is the JIT IR can track we're doing an IndexOf operation and so it can directly rely on the inputs and similar, allowing us to generate better code than the naive ExtractMostSignificantBits impl would.
We should add that support for cases like these, if its not optimized already, and/or for the common patterns involving ExtractMostSignificantBits (like PopCount(ExtractMostSignificantBits())) so they are doing the "good" thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the this, we're not planning on looking at adding intrinsics for ExtractMostSignificantBits and patterns just yet, but we'll add them to the backlog.
In the meantime, is this PR still useful? It's a nice refactor as it reduces two distinct versions into a single algorithm, reducing code size and complexity. Plus it gives a small perf boost, ~2% overall. On the downside it's adding code churn for little gain. If you don't want this PR now, then we can close it and return to it once the intrinsics have been fixed.
Re-attempt at #121383.
Refactor the vectorized code path by combining the SSE2 "intrinsified" path with the original Vector128 algorithm. There are still some platform specific code (for AdvSimd), as it is difficult to fully rely on Vector128 APIs without sacrificing performance too much. The main issue is the lack of an instruction for
Vector128.ExtractMostSignificantBitson Arm, so it is significantly slower when trying to force it to use the same mask format as the SSE2 algorithm. I have looked into the possibility of usingIndexOfandCountetc, but they also useExtractMostSignificantBitsso it poses the same problem.This PR tries to encapsulate this difference in a few helper methods so they can share the same code path for the main algorithm.
Performance wise, there is not as much improvements, but hopefully the code will be easier to maintain.
Arm Neoverse-V2:
Intel Sapphire Rapids: