Skip to content

The Polling System CPU overhead #4328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
DeviLab opened this issue Mar 28, 2025 · 16 comments · May be fixed by #4377
Open

The Polling System CPU overhead #4328

DeviLab opened this issue Mar 28, 2025 · 16 comments · May be fixed by #4377
Assignees
Labels
Milestone

Comments

@DeviLab
Copy link

DeviLab commented Mar 28, 2025

The new polling system looks interesting for the cases when there are libraries which use it, but when I updated my app to CE 3.6, I faced a significant CPU increase (more than 70%). According to the profiler it is caused by EPoll-related calls from the Cats Effect code. When I overrode the polling system to SleepSystem in the main class the problem has gone and the CPU consumption was back top normal:

import cats.effect.unsafe.{PollingSystem, SleepSystem}
....
object Main extends IOApp {
    override protected def pollingSystem: PollingSystem = SleepSystem
    .......
}

Thare are still a lot of apps and libraries which use the polling mechanisms directly and can't be re-implemented to use the new feature. It means that users can see some performance degradations just because they bumped Cats Effect to 3.6. Do you have any ideas how this issue can be addressed? Does every app which doesn't rely on the polling system require such explicit change?

@djspiewak
Copy link
Member

djspiewak commented Mar 28, 2025

I would consider this to be a bug.

Is this on JVM or Native? I would assume JVM? We do have the capability to gate polling based on whether any events have actually been registered by higher level code, but we don't actually do that yet. That is likely the solution here. In the meantime, the workaround (explicitly swapping to SleepSystem) feels correct, albeit annoying.

@djspiewak
Copy link
Member

djspiewak commented Mar 28, 2025

Actually correction: we did implement the event gating for the SelectorSystem, though only for the hot path variant (not the parking). Do you have profiling results which indicate which call is bearing all the load? In particular, are you seeing increased overhead in parkLoop or directly in run (in WorkerThread)? Also is it at all possible that parts of your app are using fs2-io but other parts aren't? So then you could be in a situation where you're using the polling system for a few small things but not the major stuff.

@DeviLab
Copy link
Author

DeviLab commented Mar 28, 2025

Yes, it's JVM. My profiling results points at various methods which call some EPoll API and consume more CPU than in 3.5, but nothing related to parkLoop. For example, WorkerThread.parkUntilNextSleeper calls EPollSelectorImpl.processEvents and consumes extra CPU:

3.6
3.5

We do use fs2-io as a part of http4s-ember-client, but it's an old version (3.11), which doesn't support the polling system yet.

@djspiewak
Copy link
Member

Yeah it's definitely the park loop. Need to think about this a bit. The tricky thing here is that waking up a thread becomes an exercise in statefulness if we make this bit contingent on outstanding events. I think that's definitely the right thing to do though.

@iRevive
Copy link
Contributor

iRevive commented Mar 28, 2025

Could you check poller/worker-thread metrics by any chance?

They are available via MBean out of the box.

@armanbilge
Copy link
Member

@DeviLab sorry if I missed it, what is your Linux version and your JVM version?

@DeviLab
Copy link
Author

DeviLab commented Mar 31, 2025

bash-5.1$ uname -a
Linux preprod-***-7ddb768f67-kp9xt 5.14.0-427.37.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Sep 24 08:06:42 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
bash-5.1$ java -version
java version "21.0.6" 2025-01-21 LTS
Java(TM) SE Runtime Environment (build 21.0.6+8-LTS-188)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.6+8-LTS-188, mixed mode, sharing)

@DeviLab
Copy link
Author

DeviLab commented Mar 31, 2025

@iRevive, I can't check it via MBean, but check it directly in IORuntime: all poller metrics are 0.

@iRevive
Copy link
Contributor

iRevive commented Apr 2, 2025

I've observed an increased CPU usage, too.

Service A. Has traffic

The CE 3.6 version was deployed at 09:05.

CPU usage

Image

Image

Service B. Nearly zero traffic

The CE 3.6 version was deployed at 12:50.

CPU usage

Image

Image

Worker thread

Image

Image

Local queue

Image

Image

Image

Timer heap

Image


The following change brought CPU usage back to normal:

override protected def pollingSystem: PollingSystem = SleepSystem

@djspiewak
Copy link
Member

@iRevive First off, you have lovely dashboards. Second, am I reading this right that when you deployed 3.6, suddenly a ton of actual work materialized on the work queues? Something is really really odd then. I can't imagine what work could have just come out of no where; this is different from the selector overhead.

I wonder if it's worth grabbing a fiber dump just to eyeball what's even going on during the idle time?

@iRevive
Copy link
Contributor

iRevive commented Apr 2, 2025

Second, am I reading this right that when you deployed 3.6, suddenly a ton of actual work materialized on the work queues?

We weren't using IO Runtime metrics before 3.6. But I will try to grab a fiber dump.

@djspiewak
Copy link
Member

Had a chat with @armanbilge last night and I'm fairly certain the optimal way to resolve this is by conditionally parking using more efficient methods and then dealing with the stateful race condition on the interrupter side. It's very tricky and adds meaningful complexity but it's not at all impossible, and it should mean that applications which don't touch the PollingSystem end up seeing zero overhead.

@djspiewak djspiewak self-assigned this Apr 13, 2025
@djspiewak djspiewak added this to the v3.6.0 milestone Apr 13, 2025
@djspiewak
Copy link
Member

@DeviLab Would you be able to test with 3.6.1-25-d807be0 by any chance? You can remove your override def pollingSystem workaround. You should see results (with this snapshot) that are very close to 3.5.x, though I wouldn't be surprised if there's a slight step back relative to that baseline.

@DeviLab
Copy link
Author

DeviLab commented Apr 14, 2025

I tested your fix. Initially I saw almost no difference between the fix and vanilla 3.6 without overridden pollingSystem. Than I looked at your PR and realized that it could be so because we've updated fs2 recently to 3.12 (with polling system integration). I rolled back fs2 to 3.11 and it became working perfectly. So, the fix works, but only in cases when there the polling system is completely not used. In our case we have http4s-ember-client (which uses fs2-io under the hood) in the project, but it's not so heavily loaded: only one thread is used to call EPoll.wait (unless fs2 relies on the polling system from CE), but when I set pollingSystem to SelectorSystem, every WorkerThread begins doing so. It would be nice to have a mechanism which would make polling not so aggressive.

Image

@armanbilge
Copy link
Member

armanbilge commented Apr 14, 2025

Thank you so much for testing and sharing results!

In our case we have http4s-ember-client (which uses fs2-io under the hood) in the project, but it's not so heavily loaded ... but when I set pollingSystem to SelectorSystem, every WorkerThread begins doing so. It would be nice to have a mechanism which would make polling not so aggressive.

That's interesting. In Daniel's #4377, a WorkerThread only transitions to polling (instead of ordinary sleeping) if there is outstanding I/O on that thread and it will transition back once the I/O is completed. i.e. it's already not very aggressive, because each thread decides individually and dynamically whether to use polling or not every time it sleeps.

So if all the worker threads are using polling all the time, then it's because there are lots of I/O events happening all the time.

@djspiewak
Copy link
Member

Two of the core assumptions of the integrated runtime is that 1) spreading data across all CPU cores is good, and 2) context shifts are super expensive. So it doesn't actually take that many I/O events to end up in a situation where every thread needs to poll, because the alternative would have been either asymmetric load (in continuation handling) or more context shifts (bouncing to another CPU after event completion).

You can see some hints of this phenomenon in the graphs actually. I would guess that ember-client isn't really fully saturating the cores, and some of our parking is probably simple (non-polling), which is why we see some irregularity in the CPU load. One plausible theory here is that we may be hitting a bit of an uncanny valley in your application: just enough polling system use to force us to take the penalty of selector use, but just enough non-polling system I/O that the penalty might not pay for itself. I understand you're still using Blaze for the server?

I would love to understand whether the end-to-end latency and throughput metrics are regressed going from fs2 3.11 to 3.12, or if this is just something which is showing up in the CPU utilization. To be clear, we know that Selector is really inefficient. There may be some ways we can improve that, but the endgame here is to bypass it entirely and go directly to epoll, which should resolve much of this incremental overhead and make the uncanny valley a lot narrower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants