-
Notifications
You must be signed in to change notification settings - Fork 555
The Polling System CPU overhead #4328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would consider this to be a bug. Is this on JVM or Native? I would assume JVM? We do have the capability to gate polling based on whether any events have actually been registered by higher level code, but we don't actually do that yet. That is likely the solution here. In the meantime, the workaround (explicitly swapping to |
Actually correction: we did implement the event gating for the |
Yes, it's JVM. My profiling results points at various methods which call some EPoll API and consume more CPU than in 3.5, but nothing related to We do use fs2-io as a part of http4s-ember-client, but it's an old version (3.11), which doesn't support the polling system yet. |
Yeah it's definitely the park loop. Need to think about this a bit. The tricky thing here is that waking up a thread becomes an exercise in statefulness if we make this bit contingent on outstanding events. I think that's definitely the right thing to do though. |
Could you check poller/worker-thread metrics by any chance? They are available via MBean out of the box. |
@DeviLab sorry if I missed it, what is your Linux version and your JVM version? |
|
@iRevive, I can't check it via MBean, but check it directly in |
I've observed an increased CPU usage, too. Service A. Has trafficThe CE 3.6 version was deployed at 09:05. Service B. Nearly zero trafficThe CE 3.6 version was deployed at 12:50. The following change brought CPU usage back to normal: override protected def pollingSystem: PollingSystem = SleepSystem |
@iRevive First off, you have lovely dashboards. Second, am I reading this right that when you deployed 3.6, suddenly a ton of actual work materialized on the work queues? Something is really really odd then. I can't imagine what work could have just come out of no where; this is different from the selector overhead. I wonder if it's worth grabbing a fiber dump just to eyeball what's even going on during the idle time? |
We weren't using IO Runtime metrics before 3.6. But I will try to grab a fiber dump. |
Had a chat with @armanbilge last night and I'm fairly certain the optimal way to resolve this is by conditionally parking using more efficient methods and then dealing with the stateful race condition on the interrupter side. It's very tricky and adds meaningful complexity but it's not at all impossible, and it should mean that applications which don't touch the |
@DeviLab Would you be able to test with |
I tested your fix. Initially I saw almost no difference between the fix and vanilla 3.6 without overridden |
Thank you so much for testing and sharing results!
That's interesting. In Daniel's #4377, a So if all the worker threads are using polling all the time, then it's because there are lots of I/O events happening all the time. |
Two of the core assumptions of the integrated runtime is that 1) spreading data across all CPU cores is good, and 2) context shifts are super expensive. So it doesn't actually take that many I/O events to end up in a situation where every thread needs to poll, because the alternative would have been either asymmetric load (in continuation handling) or more context shifts (bouncing to another CPU after event completion). You can see some hints of this phenomenon in the graphs actually. I would guess that ember-client isn't really fully saturating the cores, and some of our parking is probably simple (non-polling), which is why we see some irregularity in the CPU load. One plausible theory here is that we may be hitting a bit of an uncanny valley in your application: just enough polling system use to force us to take the penalty of selector use, but just enough non-polling system I/O that the penalty might not pay for itself. I understand you're still using Blaze for the server? I would love to understand whether the end-to-end latency and throughput metrics are regressed going from fs2 3.11 to 3.12, or if this is just something which is showing up in the CPU utilization. To be clear, we know that |
The new polling system looks interesting for the cases when there are libraries which use it, but when I updated my app to CE 3.6, I faced a significant CPU increase (more than 70%). According to the profiler it is caused by EPoll-related calls from the Cats Effect code. When I overrode the polling system to
SleepSystem
in the main class the problem has gone and the CPU consumption was back top normal:Thare are still a lot of apps and libraries which use the polling mechanisms directly and can't be re-implemented to use the new feature. It means that users can see some performance degradations just because they bumped Cats Effect to 3.6. Do you have any ideas how this issue can be addressed? Does every app which doesn't rely on the polling system require such explicit change?
The text was updated successfully, but these errors were encountered: