Skip to content

[🐛 Bug]: java.lang.OutOfMemoryError #2528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Doofus100500 opened this issue Dec 24, 2024 · 39 comments
Open

[🐛 Bug]: java.lang.OutOfMemoryError #2528

Doofus100500 opened this issue Dec 24, 2024 · 39 comments

Comments

@Doofus100500
Copy link
Contributor

Doofus100500 commented Dec 24, 2024

What happened?

Getting oom in eventbus container
image

Command used to start Selenium Grid with Docker (or Kubernetes)

helm

Relevant log output

{"class": "EventBusCommand","log-level": "INFO","log-message": "Started Selenium EventBus 4.26.0 (revision 69f9e5e): https:\u002f\u002f10.232.86.222:5557","log-name": "org.openqa.selenium.grid.commands.EventBusCommand","log-time-local": "2024-12-14T07:31:37.796Z","log-time-utc": "2024-12-14T07:31:37.796Z","method": "execute"}
Exception in thread "iothread-2" java.lang.OutOfMemoryError: Cannot reserve 8192 bytes of direct buffer memory (allocated: 501211210, limit: 501219328)
    at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
    at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:121)
    at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:332)
    at zmq.io.coder.DecoderBase.<init>(DecoderBase.java:46)
    at zmq.io.coder.Decoder.<init>(Decoder.java:71)
    at zmq.io.coder.v2.V2Decoder.<init>(V2Decoder.java:18)
    at zmq.io.StreamEngine.handshake(StreamEngine.java:805)
    at zmq.io.StreamEngine.inEvent(StreamEngine.java:386)
    at zmq.io.IOObject.inEvent(IOObject.java:85)
    at zmq.poll.Poller.run(Poller.java:275)
    at java.base/java.lang.Thread.run(Thread.java:840)

Operating System

k8s

Docker Selenium version (image tag)

4.26.0-20241101

Selenium Grid chart version (chart version)

0.37.1

Copy link

@Doofus100500, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

It looks like the actual usage memory not reach the range of request and limit resources config.
In the latest change, I add default SE_JAVA_OPTS for all component (in the server configmap, which is referred by all components) the -Xmx and -Xms for JVM selenium server.

SE_JAVA_OPTS: "-XX:+UseG1GC -Xmx1024m -Xms256m -XX:MaxGCPauseMillis=1000 -Djdk.httpclient.keepalive.timeout=300 -Djdk.httpclient.maxstreams=10000"

Can you check it helps?

@VietND96
Copy link
Member

@joerg1985, do you have any comment on this?

@Doofus100500
Copy link
Contributor Author

-Xmx1024m -Xms256m

For all components, this is extremely low. In my opinion, it is necessary to make it possible to configure these parameters for each component individually. Under load, consumption increases significantly.

@VietND96
Copy link
Member

Via extraEnvironmentVariables in each component, I think you can override the global one

@Doofus100500
Copy link
Contributor Author

But this is not reflected in the chart for the eventBus and other distributed components

@VietND96
Copy link
Member

Oh really? Can you give example yaml values that you are settings?

@Doofus100500
Copy link
Contributor Author

For example, to address the issue with the event-bus mentioned in this issue, I added the following through k9s:

- name: SE_JAVA_OPTS  
  value: -Xmx2g

@VietND96
Copy link
Member

I just checked, in chart config, all distributed components are refer to this config for extra env vars components.extraEnvironmentVariables

@Doofus100500
Copy link
Contributor Author

That’s exactly what I’m saying. I want to set appropriate parameters for each component individually, rather than, for example, setting -Xmx16g for all of them.

@VietND96
Copy link
Member

Yes, I can understand the problem now, will add that config for each component, instead of common

@VietND96
Copy link
Member

Do you observe anything else that you think to fix in chart 0.38.3 also?

@Doofus100500
Copy link
Contributor Author

Unfortunately, I haven’t even looked into it yet. If I find anything, I’ll definitely come back in the future.

VietND96 added a commit that referenced this issue Dec 26, 2024
@joerg1985
Copy link
Member

@VietND96 i had a short look at the code of EventBusCommand and when looking at this (without debugging) i would expect a leak in the /status call. It adds a listener, but never removes it. Will put this on my todo list.

@joerg1985
Copy link
Member

The leaking listeners have been fixed in SeleniumHQ/selenium@269a7f6 but i am not sure this is the root cause here, as there are only a few bytes leaked for each call to /status so the grid must be up for several days to see this.

@Doofus100500
Copy link
Contributor Author

Doofus100500 commented Dec 28, 2024

Actually, in our case, we expect the grid (except for the pods with browsers) to always be operational. Could you please check for leaks and other components?
image
image
image
image

@joerg1985
Copy link
Member

@Doofus100500 i think the best would be to create a heap histogram with jmap and share them here.

@Doofus100500
Copy link
Contributor Author

Unfortunately, I will only be able to take care of this after the 9th.

@VietND96
Copy link
Member

VietND96 commented Jan 2, 2025

Via #2546, I added the way to get HeapDumpOnOutOfMemoryError, or get heap dump on demand when terminate/stop the container to directory /opt/selenium/logs. Need to use volume to mount that dir in container to persist the output files.

@joerg1985
Copy link
Member

@Doofus100500 please wait for the next release before testing, this might be the fix for your issue: SeleniumHQ/selenium#15011

@Doofus100500
Copy link
Contributor Author

Hi @VietND96 , have you considered using XX:MaxRAMPercentage and XX:MinRAMPercentage instead of Xmx and Xms? It seems like a good solution for general configuration in:

SE_JAVA_OPTS: "-XX:+UseG1GC -Xmx1024m -Xms256m -XX:MaxGCPauseMillis=1000 -Djdk.httpclient.keepalive.timeout=300 -Djdk.httpclient.maxstreams=10000"

@Doofus100500
Copy link
Contributor Author

I’m just unsure what percentage to set for MaxRAMPercentage, could you help me with that?

@VietND96
Copy link
Member

Hi, this one I am also not sure, will try to understand and let you know if I am able to find something.

@VietND96
Copy link
Member

I tried to read something related https://stackoverflow.com/questions/75025893/is-jvm-heap-memory-option-xxmaxrampercentage-only-valid-for-dockerized-applic

When you run the application in a dedicated container, together with a known set of programs or no other programs at all, you most probably want to specify the maximum amount of memory in relation to the container’s memory, so when you want to change the available memory, you only have to reconfigure the container instead of needing to adapt all programs’ start configurations

With docker-selenium, each component (Hub/Router/Distributor/SessionQueue/SessionMap/EventBus) runs in a dedicated container with a single program, so let it utilize the maximum amount with --XX:MaxRAMPercentage=100
With component Node, besides the program, the browser also consumes memory, so let it utilize a half --XX:MaxRAMPercentage=50

@joerg1985
Copy link
Member

@VietND96 the JVM should detect the container enviroment and adjust these values automatically, see https://bugs.openjdk.org/browse/JDK-8146115 for details.

@VietND96
Copy link
Member

@joerg1985, yes, but in a few graph screenshots above, OOM happened when actual memory consumed didn't reach the range between requests and limits allowed. What is your view?

@joerg1985
Copy link
Member

There are multiple limits to the different areas of the heap. So setting MaxRAMPercentage might not help here. When setting it to 100% the heap takes all the memory, but what about the other memory areas? They also need some memory.

I don't think we need to fine tune the memory management, we need to find the root cause for the leak.
But this might have been already fixed, so lets wait for @Doofus100500 feeback when using version 4.28.0

@Doofus100500
Copy link
Contributor Author

I’m currently experiencing issues with 4.28 and have opened an issue: #2655

@Doofus100500
Copy link
Contributor Author

Doofus100500 commented Feb 25, 2025

Updated to 0.40.0(4.29.0-20250222)

Image

Image

Image

Image

Image

@Doofus100500
Copy link
Contributor Author

@joerg1985 Hi, here’s the heap histogram from the distributor, and I’m also attaching a screenshot from Grafana.

Image

heap_histogram.txt

@joerg1985
Copy link
Member

@Doofus100500 i had a look at the heap and i am wondering why there are things like okhttp3 are loaded, as far as i know this should not be on the classpath. Are you adding things to the classpath or is there a special config, e.g. a CustomLocatorHandler?

@Doofus100500
Copy link
Contributor Author

Doofus100500 commented Mar 12, 2025

@joerg1985 I’m not adding anything specific myself, i’m deploying with the chart I shared here. Could this be related to the fact that I’m enabling TLS? @VietND96 , could you please take a look?

@VietND96
Copy link
Member

Looks like CustomLocatorHandler come from Node.java new CustomLocatorHandler(this, registrationSecret, customLocators) in event Node registration.
I guess lots of Node registrations are hitting Distributor, since the setup here is autoscaling, and this is one-time Node, if setup running for a long time with serving over thousands of sessions, then number of registrations event equals that.
@joerg1985, do you have any idea to optimize this?

@Doofus100500
Copy link
Contributor Author

Would a hotfix be to exclude the use of registrationSecret?

@joerg1985
Copy link
Member

@joerg1985 I’m not adding anything specific myself, i’m deploying with the chart I shared here. Could this be related to the fact that I’m enabling TLS? @VietND96 , could you please take a look?

I think the easiest way to check this is to disable it temporary. I would also suggest to set useHttp2 to false and lower the upstreamKeepalive settings for testing.

@Doofus100500
Copy link
Contributor Author

@joerg1985 I tried disabling TLS, but okhttp3 is still present.

I would also suggest to set useHttp2 to false and lower the upstreamKeepalive settings for testing.

I don’t quite understand how this can help.

@VietND96 Disabling registrationSecret doesn’t help with the memory leak either.

@joerg1985
Copy link
Member

@joerg1985 I tried disabling TLS, but okhttp3 is still present.

I would also suggest to set useHttp2 to false and lower the upstreamKeepalive settings for testing.

I don’t quite understand how this can help.

@VietND96 Disabling registrationSecret doesn’t help with the memory leak either.

We do not use any http2 feature at all so there should no benefit, but there are alot of potential issues.

@Doofus100500
Copy link
Contributor Author

@joerg1985 Hi, I currently have publishing set up through a NodePort on an external balancer. Here’s the configuration:

        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
        client_max_body_size 0;
        proxy_http_version 1.1;
        proxy_next_upstream_tries 5;
        proxy_set_header Connection $connection_upgrade;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Host $host;
        proxy_set_header X-Scheme $scheme;
        proxy_set_header X-Real-IP $remote_addr_from_proxy;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Forwarded-Port $server_port;
        proxy_set_header X-Forwarded-Host $host;
        proxy_set_header X-Forwarded-For $remote_addr_from_proxy;
        proxy_set_header X-Request-ID $request_id;
        proxy_set_header Traceparent $otel_traceparent;

Does proxy_http_version 1.1; exclude the use of HTTP/2?

@Doofus100500
Copy link
Contributor Author

Doofus100500 commented Apr 10, 2025

@joerg1985 @VietND96 Hi, updating to selenium-grid-0.42.0(4.31.0-20250404) and disabling useHttp2(useHttp2: false in chart) did not have any effect. I've attached a screenshot of the resource consumption, the growth on the graph is when I launched many sessions. Do you have any ideas on what else to try?

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants