Simplify UTF-16 validation Vector128 codepath #121981

ylpoonlg · 2025-11-26T10:12:44Z

Re-attempt at #121383.

Refactor the vectorized code path by combining the SSE2 "intrinsified" path with the original Vector128 algorithm. There are still some platform specific code (for AdvSimd), as it is difficult to fully rely on Vector128 APIs without sacrificing performance too much. The main issue is the lack of an instruction for Vector128.ExtractMostSignificantBits on Arm, so it is significantly slower when trying to force it to use the same mask format as the SSE2 algorithm. I have looked into the possibility of using IndexOf and Count etc, but they also use ExtractMostSignificantBits so it poses the same problem.
This PR tries to encapsulate this difference in a few helper methods so they can share the same code path for the main algorithm.

Performance wise, there is not as much improvements, but hopefully the code will be easier to maintain.

Arm Neoverse-V2:

Method	Input	Version	Mean	Error	Ratio
GetByteCount	EnglishAllAscii	Before	4.437 us	0.0437 us	1.000
GetByteCount	EnglishAllAscii	After	4.475 us	0.1618 us	1.009
GetByteCount	EnglishMostlyAscii	Before	20.387 us	0.1744 us	1.000
GetByteCount	EnglishMostlyAscii	After	19.941 us	0.1079 us	0.978
GetByteCount	Chinese	Before	9.145 us	0.0072 us	1.000
GetByteCount	Chinese	After	8.992 us	0.0069 us	0.983
GetByteCount	Cyrillic	Before	7.936 us	0.0095 us	1.000
GetByteCount	Cyrillic	After	7.812 us	0.0056 us	0.984
GetByteCount	Greek	Before	10.077 us	0.0106 us	1.000
GetByteCount	Greek	After	9.952 us	0.0120 us	0.988

Intel Sapphire Rapids:

Method	Input	Version	Mean	Error	Ratio
GetByteCount	EnglishAllAscii	Before	8.144 us	0.3398 us	1.000
GetByteCount	EnglishAllAscii	After	8.126 us	0.2759 us	0.998
GetByteCount	EnglishMostlyAscii	Before	22.971 us	0.4046 us	1.000
GetByteCount	EnglishMostlyAscii	After	22.155 us	0.9902 us	0.964
GetByteCount	Chinese	Before	10.582 us	0.3425 us	1.000
GetByteCount	Chinese	After	10.048 us	0.2135 us	0.950
GetByteCount	Cyrillic	Before	9.222 us	0.1874 us	1.000
GetByteCount	Cyrillic	After	9.100 us	0.2704 us	0.987
GetByteCount	Greek	Before	11.802 us	0.3551 us	1.000
GetByteCount	Greek	After	11.224 us	0.3505 us	0.951

Combine the SSE2 codepath with a more generic Vector128 algorithm. AdvSimd is handled slightly differently to avoid using Vector128 ExtractMostSignificantBits, because there is no such equivalent instruction on Arm so the performance would be very slow otherwise.

ylpoonlg · 2025-11-26T10:15:08Z

cc @dotnet/arm64-contrib @a74nh @SwapnilGaikwad @tannergooding @EgorBo

EgorBo · 2025-11-26T11:06:58Z

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs

+            }
+
+            // Otherwise, encode the mask with 8-bits (one byte), where each bit represents one element.
+            return cmp.ExtractMostSignificantBits();


It seems that we call GetSurrogateMask for LessThan input (hence, we have a guarantee that all elements are either 0 or allbitsset) -- we have a long standing issue to optimize this in JIT instead (to avoid special casing all ExtractMostSignificantBits calls in the managed code), cc @tannergooding

Would the current way of wrapping ExtractMostSignificantBits make it difficult for the JIT to detect the optimization? And is it a x86 specific optimization or is there something similar we can do for Arm?

The managed implementations of these use ExtractMostSignificantBits, but they are marked [Intrinsic] and the JIT IR doesn't have to.

The general point of these APIs is the JIT IR can track we're doing an IndexOf operation and so it can directly rely on the inputs and similar, allowing us to generate better code than the naive ExtractMostSignificantBits impl would.

We should add that support for cases like these, if its not optimized already, and/or for the common patterns involving ExtractMostSignificantBits (like PopCount(ExtractMostSignificantBits())) so they are doing the "good" thing.

Looking at the this, we're not planning on looking at adding intrinsics for ExtractMostSignificantBits and patterns just yet, but we'll add them to the backlog.

In the meantime, is this PR still useful? It's a nice refactor as it reduces two distinct versions into a single algorithm, reducing code size and complexity. Plus it gives a small perf boost, ~2% overall. On the downside it's adding code churn for little gain. If you don't want this PR now, then we can close it and return to it once the intrinsics have been fixed.

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Nov 26, 2025

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 26, 2025

ylpoonlg marked this pull request as ready for review November 26, 2025 10:15

EgorBo reviewed Nov 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify UTF-16 validation Vector128 codepath #121981

Simplify UTF-16 validation Vector128 codepath #121981

ylpoonlg commented Nov 26, 2025

Uh oh!

ylpoonlg commented Nov 26, 2025

Uh oh!

EgorBo Nov 26, 2025

Uh oh!

ylpoonlg Nov 26, 2025

Uh oh!

tannergooding Nov 26, 2025

Uh oh!

a74nh Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Simplify UTF-16 validation Vector128 codepath #121981

Are you sure you want to change the base?

Simplify UTF-16 validation Vector128 codepath #121981

Conversation

ylpoonlg commented Nov 26, 2025

Uh oh!

ylpoonlg commented Nov 26, 2025

Uh oh!

EgorBo Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ylpoonlg Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

tannergooding Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

a74nh Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants