Skip to content

JIT: Graph-based loop inversion #116017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented May 27, 2025

Succeeds #113709 (GitHub won't let me reopen it). Part of #107749 and #108901. Fixes #50204.

@Copilot Copilot AI review requested due to automatic review settings May 27, 2025 15:51
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 27, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@amanasifkhalid amanasifkhalid requested a review from Copilot May 27, 2025 15:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request implements graph-based loop inversion in the JIT, aiming to improve IR transformation behavior and diagnostics.

  • In fgopt.cpp, the logic now returns true to indicate IR modifications in edge cases, and error handling for stmt cloning has been shifted to an assertion.
  • In fgdiagnostic.cpp, a more descriptive debug message is added for loop exits with non-loop predecessors.
  • In compiler.h, the API is updated to use a more specific function (optTryInvertWhileLoop) with a refined parameter type.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/coreclr/jit/fgopt.cpp Updated control flow for IR modification detection and error handling
src/coreclr/jit/fgdiagnostic.cpp Enhanced loop diagnostic messaging
src/coreclr/jit/compiler.h Updated API for loop inversion with adjusted parameter type
Comments suppressed due to low confidence (2)

src/coreclr/jit/fgopt.cpp:2713

  • Changing the response for non-RelOp conditions from returning false to returning true alters the behavior of IR modifications. Verify that this change is intentional and that downstream code correctly handles the modified IR state.
if (!condTree->OperIsCompare())

src/coreclr/jit/compiler.h:7123

  • The API change from 'optInvertWhileLoop(BasicBlock* block)' to 'optTryInvertWhileLoop(FlowGraphNaturalLoop* loop)' requires ensuring that all call sites and semantic expectations are updated accordingly.
bool optTryInvertWhileLoop(FlowGraphNaturalLoop* loop);

@amanasifkhalid
Copy link
Member Author

In this iteration, I've kept the current phase ordering. Here are some metrics from win-x64 benchmarks.run_pgo (we don't have aspnet collected yet):

Base:

  • Loops found: 79281
  • Loops inverted: 33526
  • Loops cloned: 2925
  • Loops unrolled: 287
  • Loops IV widened: 8556
  • Widened IVs: 8559
  • Unused IVs removed: 15687
  • Downward-counted loops: 7484
  • Loops strength-reduced: 6401
  • RBO: 90909
  • Jumps threaded: 21734

Diff:

  • Loops found: 74986 (-4835)
  • Loops inverted: 39581 (+6055)
  • Loops cloned: 3023 (+98)
  • Loops unrolled: 287 (+0)
  • Loops IV widened: 8495 (-61)
  • Widened IVs: 8499 (-60)
  • Unused IVs removed: 15647 (-40)
  • Downward-counted loops: 7474 (-10)
  • Loops strength-reduced: 6389 (-12)
  • RBO: 90173 (-736)
  • Jumps threaded: 21705 (-29)

Diffs had large size increases locally, most of it coming from the size cost of loop inversion.

@amanasifkhalid
Copy link
Member Author

It's worth noting that switching over to the graph-based implementation isn't enough to close the gap for array vs list iteration de-abstraction:

| Method                                      | Mean     | Error   | StdDev  |
|-------------------------------------------- |---------:|--------:|--------:|
| foreach_member_array                        | 186.0 ns | 0.65 ns | 0.58 ns |
| foreach_member_array_via_local              | 185.6 ns | 0.54 ns | 0.50 ns |
| foreach_member_array_via_interface          | 184.0 ns | 0.20 ns | 0.18 ns |
| foreach_member_array_via_interface_property | 185.5 ns | 0.18 ns | 0.17 ns |
| foreach_member_list                         | 326.6 ns | 0.69 ns | 0.61 ns |
| foreach_member_list_via_interface           | 331.0 ns | 1.55 ns | 1.45 ns |

For this case, we need flowgraph simplification to run before loop inversion so that one of the loop condition's branches becomes an exit branch. So we're going to need #115850 at some point.

@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib, @AndyAyersMS @jakobbotsch PTAL. Diffs are still large, though it seems less to do with runaway cloning this time. Both tweaking and moving loop inversion incurs large diffs, so I'm not sure how we want to proceed, if at all...

@amanasifkhalid
Copy link
Member Author

For this case, we need flowgraph simplification to run before loop inversion

Looking ahead, here are the diffs on win-x64 for moving inversion past flow opts, with this PR as the baseline:

Diffs are based on 2,721,962 contexts (1,064,836 MinOpts, 1,657,126 FullOpts).

MISSED contexts: base: 1,259 (0.05%), diff: 1,305 (0.05%)

Overall (+1,107,814 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 12,276,568 +4,300 +3.97%
benchmarks.run_pgo.windows.x64.checked.mch 65,200,060 +14,981 -0.28%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 11,715,990 +4,168 +4.76%
coreclr_tests.run.windows.x64.checked.mch 418,485,548 +663,013 +65.98%
libraries.crossgen2.windows.x64.checked.mch 38,576,226 +128,569 +21.34%
libraries.pmi.windows.x64.checked.mch 58,333,870 +17,067 -0.75%
libraries_tests.run.windows.x64.Release.mch 378,600,645 +27,051 +0.40%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 155,300,136 +237,977 +15.20%
realworld.run.windows.x64.checked.mch 11,739,437 +2,315 -4.38%
smoke_tests.nativeaot.windows.x64.checked.mch 5,086,720 +8,373 +5.64%
FullOpts (+1,107,814 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 12,275,866 +4,300 +3.97%
benchmarks.run_pgo.windows.x64.checked.mch 45,557,709 +14,981 -0.28%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 11,715,312 +4,168 +4.76%
coreclr_tests.run.windows.x64.checked.mch 129,252,446 +663,013 +65.98%
libraries.crossgen2.windows.x64.checked.mch 38,574,575 +128,569 +21.34%
libraries.pmi.windows.x64.checked.mch 58,221,085 +17,067 -0.75%
libraries_tests.run.windows.x64.Release.mch 174,050,484 +27,051 +0.40%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 144,253,898 +237,977 +15.20%
realworld.run.windows.x64.checked.mch 11,514,558 +2,315 -4.38%
smoke_tests.nativeaot.windows.x64.checked.mch 5,085,577 +8,373 +5.64%

Much of these size increases are driven by loop cloning (like in #115850). With cloning and unrolling disabled:

Diffs are based on 2,720,931 contexts (1,064,836 MinOpts, 1,656,095 FullOpts).

MISSED contexts: base: 2,297 (0.08%), diff: 2,336 (0.09%)

Base JIT options: JitCloneLoops=0;JitNoUnroll=1

Diff JIT options: JitCloneLoops=0;JitNoUnroll=1

Overall (+195,450 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 12,174,624 +3,509 +4.30%
benchmarks.run_pgo.windows.x64.checked.mch 64,589,923 +13,983 -0.26%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 11,615,950 +3,316 +5.21%
coreclr_tests.run.windows.x64.checked.mch 417,753,606 +59,262 +60.78%
libraries.crossgen2.windows.x64.checked.mch 38,491,688 +1,743 +10.48%
libraries.pmi.windows.x64.checked.mch 58,019,372 +6,301 -2.29%
libraries_tests.run.windows.x64.Release.mch 375,006,014 +22,383 +0.26%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 154,938,477 +78,238 +12.88%
realworld.run.windows.x64.checked.mch 11,662,301 +2,535 -5.00%
smoke_tests.nativeaot.windows.x64.checked.mch 5,049,163 +4,180 +0.32%
FullOpts (+195,450 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 12,173,922 +3,509 +4.30%
benchmarks.run_pgo.windows.x64.checked.mch 44,947,572 +13,983 -0.26%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 11,615,272 +3,316 +5.21%
coreclr_tests.run.windows.x64.checked.mch 128,520,504 +59,262 +60.78%
libraries.crossgen2.windows.x64.checked.mch 38,490,037 +1,743 +10.48%
libraries.pmi.windows.x64.checked.mch 57,906,587 +6,301 -2.29%
libraries_tests.run.windows.x64.Release.mch 170,455,853 +22,383 +0.26%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 143,892,239 +78,238 +12.88%
realworld.run.windows.x64.checked.mch 11,437,422 +2,535 -5.00%
smoke_tests.nativeaot.windows.x64.checked.mch 5,048,020 +4,180 +0.32%

@amanasifkhalid
Copy link
Member Author

As for this PR, diffs without cloning/unrolling are mostly unchanged:

Diffs are based on 2,720,993 contexts (1,064,836 MinOpts, 1,656,157 FullOpts).

MISSED contexts: base: 1,341 (0.05%), diff: 2,297 (0.08%)

Base JIT options: JitCloneLoops=0;JitNoUnroll=1

Diff JIT options: JitCloneLoops=0;JitNoUnroll=1

Overall (+1,214,442 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 12,155,711 +19,901 +13.17%
benchmarks.run_pgo.windows.x64.checked.mch 64,284,082 +309,378 +0.41%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 11,597,856 +18,700 +12.80%
coreclr_tests.run.windows.x64.checked.mch 417,633,534 +120,314 +2.42%
libraries.crossgen2.windows.x64.checked.mch 38,466,716 +25,360 +11.35%
libraries.pmi.windows.x64.checked.mch 57,978,442 +45,539 +11.43%
libraries_tests.run.windows.x64.Release.mch 374,462,113 +612,497 -1.68%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 154,907,244 +37,758 +12.10%
realworld.run.windows.x64.checked.mch 11,653,598 +11,805 +14.89%
smoke_tests.nativeaot.windows.x64.checked.mch 5,035,973 +13,190 +9.75%
FullOpts (+1,214,442 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 12,155,009 +19,901 +13.17%
benchmarks.run_pgo.windows.x64.checked.mch 44,641,731 +309,378 +0.41%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 11,597,178 +18,700 +12.80%
coreclr_tests.run.windows.x64.checked.mch 128,400,432 +120,314 +2.42%
libraries.crossgen2.windows.x64.checked.mch 38,465,065 +25,360 +11.35%
libraries.pmi.windows.x64.checked.mch 57,865,657 +45,539 +11.43%
libraries_tests.run.windows.x64.Release.mch 169,911,952 +612,497 -1.68%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 143,861,006 +37,758 +12.10%
realworld.run.windows.x64.checked.mch 11,428,719 +11,805 +14.89%
smoke_tests.nativeaot.windows.x64.checked.mch 5,034,830 +13,190 +9.75%

@jakobbotsch
Copy link
Member

cc @dotnet/jit-contrib, @AndyAyersMS @jakobbotsch PTAL. Diffs are still large, though it seems less to do with runaway cloning this time. Both tweaking and moving loop inversion incurs large diffs, so I'm not sure how we want to proceed, if at all...

What I am unsure about is if all these inversions are doing good things or not. One thing you can do is try to quirk the new inversions away and then slowly remove the quirks one by one to have a better chance of evaluating if things are looking good.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS here are the diffs for with/without size restrictions. It doesn't bring the size increases of this PR down by all that much, but the PerfScore improvements look promising:

Diffs are based on 2,721,532 contexts (1,064,836 MinOpts, 1,656,696 FullOpts).

MISSED contexts: base: 1,259 (0.05%), diff: 1,748 (0.06%)

Overall (-7,948 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 12,261,732 -635 -2.76%
benchmarks.run_pgo.windows.x64.checked.mch 64,935,836 -19,213 -0.92%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 11,700,772 -207 -1.47%
coreclr_tests.run.windows.x64.checked.mch 418,314,137 +48,834 -3.52%
libraries.crossgen2.windows.x64.checked.mch 38,576,614 -5,031 -7.25%
libraries.pmi.windows.x64.checked.mch 58,182,489 -2,011 -8.17%
libraries_tests.run.windows.x64.Release.mch 377,926,977 -33,954 -0.68%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 155,122,002 -245 +1.40%
realworld.run.windows.x64.checked.mch 11,672,433 +1,776 -0.61%
smoke_tests.nativeaot.windows.x64.checked.mch 5,086,720 +2,738 -10.45%
FullOpts (-7,948 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 12,261,030 -635 -2.76%
benchmarks.run_pgo.windows.x64.checked.mch 45,293,485 -19,213 -0.92%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch 11,700,094 -207 -1.47%
coreclr_tests.run.windows.x64.checked.mch 129,081,035 +48,834 -3.52%
libraries.crossgen2.windows.x64.checked.mch 38,574,963 -5,031 -7.25%
libraries.pmi.windows.x64.checked.mch 58,069,704 -2,011 -8.17%
libraries_tests.run.windows.x64.Release.mch 173,376,816 -33,954 -0.68%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 144,075,764 -245 +1.40%
realworld.run.windows.x64.checked.mch 11,447,554 +1,776 -0.61%
smoke_tests.nativeaot.windows.x64.checked.mch 5,085,577 +2,738 -10.45%

@AndyAyersMS
Copy link
Member

@AndyAyersMS here are the diffs for with/without size restrictions. It doesn't bring the size increases of this PR down by all that much, but the PerfScore improvements look promising:

These numbers look much smaller than the numbers above. Is this just the delta vs no size restrictions?

@AndyAyersMS
Copy link
Member

Can you try a few different values for the size restriction? It could be most of the loops we're currently missing tend to be smaller (eg foreach)?

Curious what we see at say 100, 200, 300, or maybe can you plot the distribution of sizes?

@amanasifkhalid
Copy link
Member Author

Is this just the delta vs no size restrictions?

Yes, those diffs are against graph-based loop inversion without size restrictions.

Curious what we see at say 100, 200, 300, or maybe can you plot the distribution of sizes?

Sure, let me get something together.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented May 30, 2025

@AndyAyersMS I tried the sizes you suggested, and there seems to be a noticeable drop-off between 200 and 100 nodes (about 100KB shaved off), at least in benchmarks.run_pgo:

image
image
image

I think we now have an aspnet collection; if you'd like, I can collect the same metrics for it.

@AndyAyersMS
Copy link
Member

I think we now have an aspnet collection; if you'd like, I can collect the same metrics for it.

Sure, if it's not too much trouble.

Also for the above maybe look at even smaller sizes? (say 80, 60)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loop inversion should consider top-tested loops
3 participants