DRAFT: Adaptive evac with surge #28955

kdnilsen · 2025-12-22T14:47:33Z

This is a proof of concept implementation that passed tests long ago. It has not been updated. I am creating this DRAFT pull request to allow me to study the changes that had been implemented at the time. I plan to divide the contributions of this branch into three distinct pull requets:

phase-accounting
adaptive-evac-ratio
worker-surge

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Warnings

⚠️ Patch contains a binary file (src/java.base/share/classes/jdk/internal/icu/impl/data/icudt76b/nfc.nrm)
⚠️ Patch contains a binary file (src/java.base/share/classes/jdk/internal/icu/impl/data/icudt76b/nfkc.nrm)
⚠️ Patch contains a binary file (src/java.base/share/classes/jdk/internal/icu/impl/data/icudt76b/ubidi.icu)
⚠️ Patch contains a binary file (src/java.base/share/classes/jdk/internal/icu/impl/data/icudt76b/uprops.icu)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/BorderLayout-1.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/FlowLayout-1.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/GridBagLayout-1.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/GridBagLayout-2.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/GridLayout-1.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/java/awt/doc-files/GridLayout-2.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/javax/swing/beaninfo/images/JAppletColor16.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/javax/swing/beaninfo/images/JAppletColor32.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/javax/swing/beaninfo/images/JAppletMono16.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/javax/swing/beaninfo/images/JAppletMono32.gif)
⚠️ Patch contains a binary file (src/java.desktop/share/classes/javax/swing/doc-files/JRootPane-1.gif)
⚠️ Patch contains a binary file (test/hotspot/jtreg/runtime/ClassFile/JsrRewritingTestCase.jar)
⚠️ Patch contains a binary file (test/hotspot/jtreg/runtime/ClassFile/testcase.jar)
⚠️ Patch contains a binary file (test/jdk/javax/net/ssl/HttpsURLConnection/crisubn.jks)
⚠️ Patch contains a binary file (test/jdk/javax/net/ssl/HttpsURLConnection/trusted.jks)
⚠️ Patch contains a binary file (test/jdk/javax/rmi/ssl/keystore)
⚠️ Patch contains a binary file (test/jdk/javax/rmi/ssl/truststore)
⚠️ Patch contains a binary file (test/jdk/jdk/internal/loader/URLClassPath/testclasses.jar)
⚠️ Patch contains a binary file (test/jdk/sun/net/www/protocol/https/HttpsClient/dnsstore)
⚠️ Patch contains a binary file (test/jdk/sun/net/www/protocol/https/HttpsClient/ipstore)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28955/head:pull/28955
$ git checkout pull/28955

Update a local copy of the PR:
$ git checkout pull/28955
$ git pull https://git.openjdk.org/jdk.git pull/28955/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28955

View PR using the GUI difftool:
$ git pr show -t 28955

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28955.diff

This enables mixed evacuations to be more productive more quickly.

Before this commit, release build made reference to a variable that is only present in debug builds.

At init update refs, we transfer remaining memory from Collector and OldCollector partitions in order to allow more mutator allocations to succeed during update-refs. A conditional transfer is not reliable because soft_max_capacity() constraints may cause the transfer to fail.

Added tag jdk-25+13 for changeset 11a37c8

Added tag jdk-25+14 for changeset a347ecd

At the end of each GC cycle, recompute the live memory in each mixed-evac candidate region.

Code compiles but lots of jtreg failures...

Now passes hotspot jtreg tests

With the ability to surge old gc workers, we can reduce the default minimum time slide. Under maximum surge, it will be as if we are able to complete 150 ms of work within this 50 ms time slice.

Highlights: 1. Exclude surged phase metrics from history-based predictions of future phase times. 2. Attenuate surge levels. Be more willing to surge in small ways, less willing to surge in big ways. 3. Decrease minimum time slice for old to 80 ms from 100 ms. But we allow surging of old within this 80 ms, unlike when 100 ms time slice was established as default. 4. Do not predict_gc_time() in should_surge_phase() unless the phase is mark.

Previous configuration appears to be stable. But we're still observing suboptimal behavior. In particular, on the Extremem medium workload, we see three back-to-back old collections, where each is delayed until the previous old has finished its mixed evacuations and the coalesce-and-fill efforts that follow it. Independently, a study of specjbb 2015 suggests that G1 and ZGC are both much more relaxed about how much they allow Old-gen to expand without any effort to collect it. And this works well for their specjbb scores. So we want to be a bit more lazy about old collections. ShenandoahIgnoreOldGrowthBelowPercentage defaults to 40 (not 10) ShenandoahDoNotIgnoreGrowthAfterYoungCycles defaults to 500 (not 50) ShenandoahOldGrowthPercent defaults to 25 (not 12.5) On the same Extremem medium workload, we observe that when we are trying "desperately" to make progress on both old-gen and young-gen collections concurrently, the old is being starved and not making quick enough progress because it only gets 80 ms time slices between young cycles and it is not surging its own efforts. So I want to allow old to surge more quickly (not attenuate its surges), and I think it best to allow young to run more quickly when it is under duress (by not attenuating its surges). The intention is that if young can run more quickly, there will be an increase of idle time between young cycles, during which old can get more than its minimal time slice. The hope is that the combination of allowing the important and urgent old collection to start earlier (because it doesn't have to wait for a previous unnecessary and redundant old collection to finish) and more aggressive surging of workers once we start old will allow old to finish in a more timely and effective manner.

Be less eager to promote by shared allocation. Note that failure to promote is "not" harmless because we have not reserved for evacuation of these objects in young reserve. But it is slightly easier (usually) to redirect an evacuation to an existing mutator region than to flip an empty mutator region.

This configuration has demonstrated good results on the Extremem medium workload for one run. Will be testing more runs momentarily. 1. Use non-conservative (without adjustment by stdev) gc time prediction for OldEvacRatio computations. Otherwise, the predictions are so conservative as to "scare" the heuristic away from attempting mixed evacuations. We still use conservative calculations for triggering GC and for deciding whether to surge. 2. Calculate live at last old mark using SATB protocol for live at start of last mark so that we exclude all the floating garbage created during concurrent old marking. 3. Use adaptive old-gc time slice, starting with 150 ms and increasing by 50 ms after every 4 old-gen mark increments up to 10% of young cycle time.. 4. Feed anticipated promotion reserve into calculations of OldEvacRatio. 5. Add a new parameter ShenandoanMinimumOldGrowthHeapPercent, default value 5. This allows greater old-gen triggering sensitivity when old-gen is very large, without requiring excessive old triggers during initialization (while old-gen may be small).

A behavior manifest in this development branch is that there is greater variation in collection set evacuation workloads. When we happen to have very large evacuation workloads, more of the evacuation effort is handled by mutator threads, causing the available reserves to be scattered between a larger number of GCLABs and PLABs. The end results is that we are much more likely to experience OOM during evac. Plan to restore original default values when we implement a new OOM-during-evac protocol that does not require upgrade to Full GC.

…ger threshold

kdnilsen · 2025-12-22T14:48:23Z

src/hotspot/share/gc/shenandoah/heuristics/shenandoahAdaptiveHeuristics.hpp

 #include "gc/shenandoah/shenandoahSharedVariables.hpp"
 #include "memory/allocation.hpp"
 #include "utilities/numberSeq.hpp"



These changes belong to phase-accounting

bridgekeeper · 2025-12-22T14:49:04Z

👋 Welcome back kdnilsen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

kdnilsen · 2025-12-22T14:49:22Z

src/hotspot/share/gc/shenandoah/heuristics/shenandoahAdaptiveHeuristics.hpp

  virtual bool is_diagnostic()   { return false; }
  virtual bool is_experimental() { return false; }

+  virtual void record_phase_duration(ShenandoahMajorGCPhase p, double x, double duration) override;


These changes belong to phase-accounting

openjdk · 2025-12-22T14:50:11Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-12-22T14:52:13Z

@kdnilsen this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout adaptive-evac-with-surge
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk · 2025-12-22T14:52:55Z

@kdnilsen The following labels will be automatically applied to this pull request:

hotspot-gc
shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

kdnilsen · 2025-12-22T15:02:53Z

src/hotspot/share/gc/shenandoah/heuristics/shenandoahAdaptiveHeuristics.hpp

  virtual void choose_collection_set_from_regiondata(ShenandoahCollectionSet* cset,
                                                     RegionData* data, size_t size,
                                                     size_t actual_free);



This is declared by shenandoahHeuristics, overrides in shenandoahAdaptiveHeuristics. ShenandoahAdaptiveHeuristics uses this to set the start time for the mark phase.

Add override declaration.

This change belongs to phase-accounting

kdnilsen · 2025-12-22T15:08:31Z