Skip to content

Conversation

@ylpoonlg
Copy link
Contributor

Re-attempt at #121383.

Refactor the vectorized code path by combining the SSE2 "intrinsified" path with the original Vector128 algorithm. There are still some platform specific code (for AdvSimd), as it is difficult to fully rely on Vector128 APIs without sacrificing performance too much. The main issue is the lack of an instruction for Vector128.ExtractMostSignificantBits on Arm, so it is significantly slower when trying to force it to use the same mask format as the SSE2 algorithm. I have looked into the possibility of using IndexOf and Count etc, but they also use ExtractMostSignificantBits so it poses the same problem.
This PR tries to encapsulate this difference in a few helper methods so they can share the same code path for the main algorithm.

Performance wise, there is not as much improvements, but hopefully the code will be easier to maintain.

Arm Neoverse-V2:

Method Input Version Mean Error Ratio
GetByteCount EnglishAllAscii Before 4.437 us 0.0437 us 1.000
GetByteCount EnglishAllAscii After 4.475 us 0.1618 us 1.009
GetByteCount EnglishMostlyAscii Before 20.387 us 0.1744 us 1.000
GetByteCount EnglishMostlyAscii After 19.941 us 0.1079 us 0.978
GetByteCount Chinese Before 9.145 us 0.0072 us 1.000
GetByteCount Chinese After 8.992 us 0.0069 us 0.983
GetByteCount Cyrillic Before 7.936 us 0.0095 us 1.000
GetByteCount Cyrillic After 7.812 us 0.0056 us 0.984
GetByteCount Greek Before 10.077 us 0.0106 us 1.000
GetByteCount Greek After 9.952 us 0.0120 us 0.988

Intel Sapphire Rapids:

Method Input Version Mean Error Ratio
GetByteCount EnglishAllAscii Before 8.144 us 0.3398 us 1.000
GetByteCount EnglishAllAscii After 8.126 us 0.2759 us 0.998
GetByteCount EnglishMostlyAscii Before 22.971 us 0.4046 us 1.000
GetByteCount EnglishMostlyAscii After 22.155 us 0.9902 us 0.964
GetByteCount Chinese Before 10.582 us 0.3425 us 1.000
GetByteCount Chinese After 10.048 us 0.2135 us 0.950
GetByteCount Cyrillic Before 9.222 us 0.1874 us 1.000
GetByteCount Cyrillic After 9.100 us 0.2704 us 0.987
GetByteCount Greek Before 11.802 us 0.3551 us 1.000
GetByteCount Greek After 11.224 us 0.3505 us 0.951

Combine the SSE2 codepath with a more generic Vector128 algorithm.

AdvSimd is handled slightly differently to avoid using Vector128
ExtractMostSignificantBits, because there is no such equivalent
instruction on Arm so the performance would be very slow otherwise.
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Nov 26, 2025
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 26, 2025
@ylpoonlg
Copy link
Contributor Author

cc @dotnet/arm64-contrib @a74nh @SwapnilGaikwad @tannergooding @EgorBo

@ylpoonlg ylpoonlg marked this pull request as ready for review November 26, 2025 10:15
}

// Otherwise, encode the mask with 8-bits (one byte), where each bit represents one element.
return cmp.ExtractMostSignificantBits();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we call GetSurrogateMask for LessThan input (hence, we have a guarantee that all elements are either 0 or allbitsset) -- we have a long standing issue to optimize this in JIT instead (to avoid special casing all ExtractMostSignificantBits calls in the managed code), cc @tannergooding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the current way of wrapping ExtractMostSignificantBits make it difficult for the JIT to detect the optimization? And is it a x86 specific optimization or is there something similar we can do for Arm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The managed implementations of these use ExtractMostSignificantBits, but they are marked [Intrinsic] and the JIT IR doesn't have to.

The general point of these APIs is the JIT IR can track we're doing an IndexOf operation and so it can directly rely on the inputs and similar, allowing us to generate better code than the naive ExtractMostSignificantBits impl would.

We should add that support for cases like these, if its not optimized already, and/or for the common patterns involving ExtractMostSignificantBits (like PopCount(ExtractMostSignificantBits())) so they are doing the "good" thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the this, we're not planning on looking at adding intrinsics for ExtractMostSignificantBits and patterns just yet, but we'll add them to the backlog.

In the meantime, is this PR still useful? It's a nice refactor as it reduces two distinct versions into a single algorithm, reducing code size and complexity. Plus it gives a small perf boost, ~2% overall. On the downside it's adding code churn for little gain. If you don't want this PR now, then we can close it and return to it once the intrinsics have been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Indicates that the PR has been added by a community member needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants