-
Notifications
You must be signed in to change notification settings - Fork 6.2k
DRAFT: Adaptive evac with surge #28955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This enables mixed evacuations to be more productive more quickly.
Before this commit, release build made reference to a variable that is only present in debug builds.
At init update refs, we transfer remaining memory from Collector and OldCollector partitions in order to allow more mutator allocations to succeed during update-refs. A conditional transfer is not reliable because soft_max_capacity() constraints may cause the transfer to fail.
Added tag jdk-25+13 for changeset 11a37c8
Added tag jdk-25+14 for changeset a347ecd
At the end of each GC cycle, recompute the live memory in each mixed-evac candidate region.
Code compiles but lots of jtreg failures...
Now passes hotspot jtreg tests
With the ability to surge old gc workers, we can reduce the default minimum time slide. Under maximum surge, it will be as if we are able to complete 150 ms of work within this 50 ms time slice.
Highlights: 1. Exclude surged phase metrics from history-based predictions of future phase times. 2. Attenuate surge levels. Be more willing to surge in small ways, less willing to surge in big ways. 3. Decrease minimum time slice for old to 80 ms from 100 ms. But we allow surging of old within this 80 ms, unlike when 100 ms time slice was established as default. 4. Do not predict_gc_time() in should_surge_phase() unless the phase is mark.
Previous configuration appears to be stable. But we're still observing suboptimal behavior. In particular, on the Extremem medium workload, we see three back-to-back old collections, where each is delayed until the previous old has finished its mixed evacuations and the coalesce-and-fill efforts that follow it. Independently, a study of specjbb 2015 suggests that G1 and ZGC are both much more relaxed about how much they allow Old-gen to expand without any effort to collect it. And this works well for their specjbb scores. So we want to be a bit more lazy about old collections. ShenandoahIgnoreOldGrowthBelowPercentage defaults to 40 (not 10) ShenandoahDoNotIgnoreGrowthAfterYoungCycles defaults to 500 (not 50) ShenandoahOldGrowthPercent defaults to 25 (not 12.5) On the same Extremem medium workload, we observe that when we are trying "desperately" to make progress on both old-gen and young-gen collections concurrently, the old is being starved and not making quick enough progress because it only gets 80 ms time slices between young cycles and it is not surging its own efforts. So I want to allow old to surge more quickly (not attenuate its surges), and I think it best to allow young to run more quickly when it is under duress (by not attenuating its surges). The intention is that if young can run more quickly, there will be an increase of idle time between young cycles, during which old can get more than its minimal time slice. The hope is that the combination of allowing the important and urgent old collection to start earlier (because it doesn't have to wait for a previous unnecessary and redundant old collection to finish) and more aggressive surging of workers once we start old will allow old to finish in a more timely and effective manner.
Be less eager to promote by shared allocation. Note that failure to promote is "not" harmless because we have not reserved for evacuation of these objects in young reserve. But it is slightly easier (usually) to redirect an evacuation to an existing mutator region than to flip an empty mutator region.
This configuration has demonstrated good results on the Extremem medium workload for one run. Will be testing more runs momentarily. 1. Use non-conservative (without adjustment by stdev) gc time prediction for OldEvacRatio computations. Otherwise, the predictions are so conservative as to "scare" the heuristic away from attempting mixed evacuations. We still use conservative calculations for triggering GC and for deciding whether to surge. 2. Calculate live at last old mark using SATB protocol for live at start of last mark so that we exclude all the floating garbage created during concurrent old marking. 3. Use adaptive old-gc time slice, starting with 150 ms and increasing by 50 ms after every 4 old-gen mark increments up to 10% of young cycle time.. 4. Feed anticipated promotion reserve into calculations of OldEvacRatio. 5. Add a new parameter ShenandoanMinimumOldGrowthHeapPercent, default value 5. This allows greater old-gen triggering sensitivity when old-gen is very large, without requiring excessive old triggers during initialization (while old-gen may be small).
A behavior manifest in this development branch is that there is greater variation in collection set evacuation workloads. When we happen to have very large evacuation workloads, more of the evacuation effort is handled by mutator threads, causing the available reserves to be scattered between a larger number of GCLABs and PLABs. The end results is that we are much more likely to experience OOM during evac. Plan to restore original default values when we implement a new OOM-during-evac protocol that does not require upgrade to Full GC.
| #include "gc/shenandoah/shenandoahSharedVariables.hpp" | ||
| #include "memory/allocation.hpp" | ||
| #include "utilities/numberSeq.hpp" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes belong to phase-accounting
|
👋 Welcome back kdnilsen! A progress list of the required criteria for merging this PR into |
| virtual bool is_diagnostic() { return false; } | ||
| virtual bool is_experimental() { return false; } | ||
|
|
||
| virtual void record_phase_duration(ShenandoahMajorGCPhase p, double x, double duration) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes belong to phase-accounting
|
❗ This change is not yet ready to be integrated. |
|
@kdnilsen this pull request can not be integrated into git checkout adaptive-evac-with-surge
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
| virtual void choose_collection_set_from_regiondata(ShenandoahCollectionSet* cset, | ||
| RegionData* data, size_t size, | ||
| size_t actual_free); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is declared by shenandoahHeuristics, overrides in shenandoahAdaptiveHeuristics. ShenandoahAdaptiveHeuristics uses this to set the start time for the mark phase.
Add override declaration.
This change belongs to phase-accounting
| return _phase_stats[ShenandoahMajorGCPhase::_final_roots].predict_at_without_stdev((double) anticipated_pip_words); | ||
| } | ||
|
|
||
| uint ShenandoahYoungHeuristics::should_surge_phase(ShenandoahMajorGCPhase phase, double now) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This belongs in worker-surge
worker-surge depends on phase-accounting
| return MIN3(evac_slack_spiking, evac_slack_avg, evac_min_threshold); | ||
| } | ||
|
|
||
| void ShenandoahYoungHeuristics::adjust_old_evac_ratio(size_t old_cset_regions, size_t young_cset_regions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code belongs in adjust-old-evac-ratio
|
|
||
| ShenandoahYoungHeuristics::ShenandoahYoungHeuristics(ShenandoahYoungGeneration* generation) | ||
| : ShenandoahGenerationalHeuristics(generation) { | ||
| : ShenandoahGenerationalHeuristics(generation), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new initializers belong to phase-accounting
| size_t cur_young_garbage = add_preselected_regions_to_collection_set(cset, data, size); | ||
|
|
||
| choose_young_collection_set(cset, data, size, actual_free, cur_young_garbage); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe most of this code can be discarded. Need to carefully read through the changes.
I believe I want to preserve only the following, as part of phase-accounting:
- Rather than expediting mixed evacuations, use the phase-accounting prediction of GC time to trigger mixed evacuation at the appropriate time.
- Enhance this subsequently, when we have adaptive-old-evac-ratio
| // Checks that an old cycle has run for at least ShenandoahMinimumOldTimeMs before allowing a young cycle. | ||
| if (ShenandoahMinimumOldTimeMs > 0) { | ||
| #undef KELVIN_START_YOUNG | ||
| #ifdef KELVIN_START_YOUNG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is some code here that relates to adjustable minimum time for old marking. Make this part of the worker-surge. (Increasing the time slice for old may depend on being able to surge the workers during the "short-changed young" time slice.
| if (ShenandoahAdaptiveHeuristics::should_start_gc()) { | ||
| return true; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skeptical of this code. Test and probably delete. (There have been lots of improvements to old triggering heuristics since this proof of concept was last demonstrated.)
This is a proof of concept implementation that passed tests long ago. It has not been updated. I am creating this DRAFT pull request to allow me to study the changes that had been implemented at the time. I plan to divide the contributions of this branch into three distinct pull requets:
Progress
Warnings
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28955/head:pull/28955$ git checkout pull/28955Update a local copy of the PR:
$ git checkout pull/28955$ git pull https://git.openjdk.org/jdk.git pull/28955/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28955View PR using the GUI difftool:
$ git pr show -t 28955Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28955.diff