Skip to content

Dynamically bypass selector polling if no I/O events are present #4377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: series/3.6.x
Choose a base branch
from

Conversation

djspiewak
Copy link
Member

@djspiewak djspiewak commented Apr 13, 2025

With this change, every time we park a thread we check to see if the underlying polling system needsPoll, which is a hint that is intended to be implemented by checking if any events have been registered with the underlying polling system. If it returns false, we don't even bother hitting the poller and instead park the worker thread the old fashioned way. This has some slightly complex concurrent coordination implications which makes this whole thing just a bit more complex and introduces a tiny spin wait, but otherwise is generally okay.

This should bring thread suspension performance for applications which are not using PollingSystem back very close to what it was in 3.5.x. The needsPoll call does have some overhead, and the more complex suspension state does have a cost, but it shouldn't be that severe. I'm traveling so I haven't had a chance to run benchmarks yet but that's next on my todo list.

Fixes #4328

@djspiewak
Copy link
Member Author

Released as 3.6.1-25-d807be0 if anyone would like to test something

@djspiewak djspiewak linked an issue Apr 13, 2025 that may be closed by this pull request
@djspiewak djspiewak added this to the v3.6.next milestone Apr 13, 2025
@djspiewak
Copy link
Member Author

DispatcherBenchmark is the biggest change, regressing 25%. parTraverse regressed slightly but so did runnableSchedulingScalaGlobal, which is more of a concurrency baseline, and parTraverse regressed less so let's call that one a tentative win. traverse is equivalent (a non-concurrency baseline). All the main WSTP benchmarks are essentially within the margin of error.

I'd like to see the real world numbers on this one, but the microbenchmarks seem to give their blessing in so far as they go.

Before

[info] Benchmark                                             (cpuTokens)   (size)   Mode  Cnt     Score     Error    Units
[info] DispatcherBenchmark.scheduling                                N/A     1000  thrpt   20   169.751 ± 100.207  ops/min
[info] ParallelBenchmark.parTraverse                               10000     1000  thrpt   20   543.307 ±   1.551    ops/s
[info] ParallelBenchmark.traverse                                  10000     1000  thrpt   20    50.845 ±   0.015    ops/s
[info] WorkStealingBenchmark.alloc                                   N/A  1000000  thrpt   20     9.510 ±   0.049  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark          N/A  1000000  thrpt   20    12.188 ±   0.822  ops/min
[info] WorkStealingBenchmark.runnableScheduling                      N/A  1000000  thrpt   20  1529.989 ±  38.601  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal           N/A  1000000  thrpt   20  1746.552 ±   9.345  ops/min
[info] WorkStealingBenchmark.scheduling                              N/A  1000000  thrpt   20    13.742 ±   0.398  ops/min

After

Note that the global EC is 4% slower in this run, but traverse is essentially within the margin of error. That indicates that we would generally expect anything concurrency-related to have a lower baseline in this run than in the previous one. Good old EC2!

[info] Benchmark                                             (cpuTokens)   (size)   Mode  Cnt     Score    Error    Units
[info] DispatcherBenchmark.scheduling                                N/A     1000  thrpt   20   125.216 ± 15.753  ops/min
[info] ParallelBenchmark.parTraverse                               10000     1000  thrpt   20   529.290 ± 26.988    ops/s
[info] ParallelBenchmark.traverse                                  10000     1000  thrpt   20    50.864 ±  0.008    ops/s
[info] WorkStealingBenchmark.alloc                                   N/A  1000000  thrpt   20     9.499 ±  0.036  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark          N/A  1000000  thrpt   20    12.800 ±  0.604  ops/min
[info] WorkStealingBenchmark.runnableScheduling                      N/A  1000000  thrpt   20  1539.262 ± 13.460  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal           N/A  1000000  thrpt   20  1683.373 ± 25.941  ops/min
[info] WorkStealingBenchmark.scheduling                              N/A  1000000  thrpt   20    13.744 ±  0.501  ops/min

val polling = st eq ParkedSignal.ParkedPolling
val simple = st eq ParkedSignal.ParkedSimple

if ((polling || simple) && signal.compareAndSet(st, ParkedSignal.Interrupting)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic here is different from the old implementation: getAndSet involves a loop, but there is no longer a loop here.

In the (admittedly unlikely) scenario that a worker thread transitions from ParkedPolling to ParkedSimple or vice versa while evaluating this condition, we would prematurely give up on attempting to wake up this thread. We might consider looping if the compareAndSet fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that might not be worth it. So going from ParkedPolling to ParkedSimple would require that between the get() above and the CAS, the other thread would need to wake itself up (or be awakened by another notify), run through its whole state machine, pick up a fiber, execute that fiber through the whole machinery, run out of work, go through the whole "check all the things for work" (which, includes doing the thing that we're trying to notify it to do here!) and then finally go back and park again, all in the space of just these few lines.

Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one. The work item we're trying to notify on would actually get picked up (before the target thread resuspends), and we'll ultimately wake the same number of threads.

This compared to having an actual CAS loop which would require more complexity here and I think an extra conditional jump, despite the fact that it's effectively never going to be hit and wouldn't matter even if it did. So I think I'd rather just eat the CAS failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one.

Sure ... I am just worried about relying on the existence of another thread. For example, if there is exactly once worker thread in the runtime, could we end up in a deadlock?

I have thought through it a few times and I have not come up with a scenario yet that could deadlock, but I don't feel confident.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's only one worker then that worker is either awake already (in which case we can't have entered this conditional) or it's parked and we're notifying it (in which case all the normal rules apply).

Copy link
Member

@armanbilge armanbilge Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or it's parked and we're notifying it (in which case all the normal rules apply).

Right, so if all the normal rules are applying ...

Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one.

What happens if we fail the CAS due to the race condition, we don't loop, and then we don't go to the next thread because there is no next thread? We've given up on our one-and-only thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You words precisely capture my feelings about this:

Now, I think you're probably right that nothing is lost here since the thread will both check the external queue and attempt to steal before going back to sleep, but it's hard for me to 100% convince myself that this is safe in all circumstances. This is very similar to the multi-consumer queue cases, which are genuinely insanely subtle and we've been bitten quite a bit with lost notifications. Speaking more generally, we tend to always bias notifications in favor of over-notifying rather than under-notifying, since the worst case in the former is a bit less performance, while the worst case in the latter is a deadlock.

@djspiewak
Copy link
Member Author

So my interpretation of the results from #4328 is that this is indeed an improvement, and we likely have some more work to do on Selector constant factor overhead.

@durban
Copy link
Contributor

durban commented Apr 14, 2025

I'm probably missing something, but do we really need the ParkedSignal.Interrupting state? What does the spinlock actually protects? (Spurious wakeups already must be handled.)

(Btw, I've restarted the "windows-latest, 2.13.15, temurin@17, ciJVM" job, because it timed out. But I'm wondering if it was a real failure.)

Copy link
Member

@armanbilge armanbilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but do we really need the ParkedSignal.Interrupting state? What does the spinlock actually protects? (Spurious wakeups already must be handled.)

@durban My understanding is this protects us against the case where the interruptor reads that the worker thread is in the "polling" or "simple" state, but by the time it invokes the appropriate interrupt action, the worker thread is parked in an different state (and thus the interrupt is ineffective).

val polling = st eq ParkedSignal.ParkedPolling
val simple = st eq ParkedSignal.ParkedSimple

if ((polling || simple) && signal.compareAndSet(st, ParkedSignal.Interrupting)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the fact that this race condition is almost inconceivable, it effectively just results in a loss of performance since we give up on the thread and go wake another one.

Sure ... I am just worried about relying on the existence of another thread. For example, if there is exactly once worker thread in the runtime, could we end up in a deadlock?

I have thought through it a few times and I have not come up with a scenario yet that could deadlock, but I don't feel confident.

Comment on lines 755 to 761
// just create a bit of chaos with timers and async completion
val sleeps = 0.until(10).map(i => IO.sleep((i * 10).millis)).toList

val latch = IO.deferred[Unit].flatMap(d => d.complete(()).start *> d.get)
val latches = 0.until(10).map(_ => latch).toList

val test = (sleeps ::: latches).parSequence.parReplicateA_(100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a bit more chaos: specifically, exercising the external queue somehow.

Also, I wonder if we should have a version of this test where the runtime has exactly one worker, that's how we've caught various bugs before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not averse to running a version of the test with only one worker. I'll try tickling the external queue first.

@durban
Copy link
Contributor

durban commented Apr 14, 2025

Okay, so the way to wake someone up is (without this PR) essentially:

if (signal.getAndSet(false)) system.interrupt(...)

With the 2 different parking states it could be:

val old = signal.getAndSet(Unparked)
// HERE
if (old eq ParkedPolling) system.interrupt(...)
else if (old eq ParkedSimple) LockSupport.unpark(...)

And yes, it could happen that at the point marked with HERE above the target thread wakes up, then goes back to sleep with a different way. But is that a problem? We wanted to wake it up, and... it did! (And sure, then we might make a wrong unpark call, but that should be harmless(?).)

(Again, I might be missing something, I just started thinking about this...)

def parkLoop(needsPoll: Boolean): Boolean = {
if (done.get()) {
false
} else {
// Park the thread until further notice.
val start = System.nanoTime()
metrics.incrementPolledCount()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this metric be incremented if only needPoll==true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Yes.

@djspiewak
Copy link
Member Author

@durban I think we need Interrupting for two reasons.

First, when the WorkerThread is awakened by either a timer or I/O event, it must ensure that WorkStealingThreadPool#doneSleeping() is invoked. This is distinct from the situation where the thread is awakened by the pool itself, when doneSleeping() is invoked directly from the notifying side. Checking for Interrupting allows us to disambiguate this situation and ensure that exactly one call is performed, regardless of whether we were awakened by the kernel or the pool. In a sense, this is actually a three-way race condition, and the kernel is one of the legs.

The second reason here is the situation @armanbilge outlined: the pool decides to a awaken a thread which is parked polling, that thread wakes on its own, does its thing, goes all the way through its state machine, and then parks again without polling before the pool has the ability to wake it. In this situation, a notification ends up being "lost". Now, I think you're probably right that nothing is lost here since the thread will both check the external queue and attempt to steal before going back to sleep, but it's hard for me to 100% convince myself that this is safe in all circumstances. This is very similar to the multi-consumer queue cases, which are genuinely insanely subtle and we've been bitten quite a bit with lost notifications. Speaking more generally, we tend to always bias notifications in favor of over-notifying rather than under-notifying, since the worst case in the former is a bit less performance, while the worst case in the latter is a deadlock.

I'm not saying we shouldn't try to reason through this, and if we can find a way to remove the Interrupting state we absolutely should do it, but I couldn't figure out a satisfying way around the first problem, and the second problem still seems scary even if not immediately obviously fatal.

@iRevive
Copy link
Contributor

iRevive commented Apr 17, 2025

I tried 3.6.1-25-d807be0, here are the results:

CPU Usage

image

Although the graph shows the mean value as 20%, the startup inflates the number. The more realistic number is ~13% when the system is idling.

Events per worker

image

The sum of all events (blocked, parked, polled) across all workers.
For instance, CE 3.6.1 SelectorSystem polls the thread 7.690.942 times per 5 minutes, CE 3.6.1-25-d807be0 - 11.766.

Ah, the poll number should be even lower, if I get it right: #4377 (comment).

@durban
Copy link
Contributor

durban commented Apr 17, 2025

@djspiewak Yeah, I haven't looked at doneSleeping yet, makes sense. Also, I completely understand being cautious.


Something else: https://github.com/typelevel/cats-effect/actions/runs/14459016996/job/40548044386?pr=4377#step:22:4843 This is the second time I've seen this branch hang in MutexSpec. I fear this could be something real...

@djspiewak djspiewak closed this May 24, 2025
@djspiewak djspiewak reopened this May 24, 2025
@djspiewak
Copy link
Member Author

This does feel like something real, but it's definitely kinda nuts. It also surprises me that MutexSpec is reproducing it rather than one of the Deferred variants, which are normally where we find this sort of thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The Polling System CPU overhead
5 participants