-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Quorum queues can enter a state it cannot recover from due to a timeout #13828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@matthew-s-walker the sooner you can, the better. That way when a fix is made, a means to test it is readily available.
@kjnilsson - would having a script to reproduce this issue speed up your resolution? Or is it just necessary to test this issue? |
because it will be very environment specific I don't think a script would help. In short we just need to handle the timeout and rollback the queue creation like we do for other errors. |
Thanks for the quick responses all. Apologies for the delay, it's been very busy. You're right on the money spotting the IOPS issue - I can confirm that the queue creation stress test ran at ~35 queues creations/deletions per second for 2 days straight without any issue after switching the volumes over to GP3. I'll note again that this is significantly higher queue attrition than our production system I was just stating the absolute worst case scenario and the configuration required to actually reproduce the issue outside of our production environment. Typical queue creation rates are <1/s. I suspect that we will wait until whichever release this this fix eventually goes into before switching back to quorum queues, as we wouldn't want to risk a similar repeat incident. Unfortunately, non-highly available queues are no longer acceptable for our use case - the repercussions of node failures that we've experienced recently are too great. @kjnilsson In terms of testing any fixes, it's pretty easy for me to get our system back into the original configuration and I'd be very happy to run the test against a pre-release container image or do manual container builds if that process is straight forward. I imagine it's possible that there are other areas which need extra timeout handling as well. Please let me know if you end up prioritising this ticket and would like me to do that. |
I am clarifying with @kjnilsson where specifically we want those timeouts to be handled. |
So the place that needs to handle an error and delete (roll back) QQ cluster member placement is https://github.com/rabbitmq/ra/blob/ae8cbf2de8325d9665d3789b1d4817f5ddee60cb/src/ra.erl#L396, and to make that more feasible, {ok, Started, NotStarted} | {error, timeout} instead of throwing an error. rabbitmq/ra#539 describes the changes in Ra. |
Closes #539. References rabbitmq/rabbitmq-server#13828.
Closes #539. References rabbitmq/rabbitmq-server#13828.
rabbitmq/ra#540 is in, we just need to produce a new |
Ra was bumped to |
Uh oh!
There was an error while loading. Please reload this page.
Moving back to an issue because @kjnilsson has identified something to address regardless of the quorum queue churn (a workload they were never designed for).
Originally filed by @matthew-s-walker.
Discussed in #13827
Originally posted by matthew-s-walker April 29, 2025
Describe the bug
Hi,
Firstly, I want to thank you for your work on RabbitMQ. It has been a rock solid core component of our system for many years.
We migrated all of our queues to the quorum queue type recently but have unfortunately encountered stability problems in our production environment.
Our system creates temporary queues, often up to 50 across 1 second or so, and totalling roughly 20000 per day.
After migrating, we found that within a few hours some queues (typically several created at similar times) will go into a state where:
The issue either occurs immediately after/during creation or within 2-3 minutes of creation.
We can reproduce the behaviour on the following versions of RabbitMQ, but the errors logged by the servers are different in at least 4.1.0:
On 4.0.1 and below, we receive various "badmatch" "timeout" errors, which I can provide if wanted.
Our cluster setup is:
Typical cluster load is < 1000 total queues, < 500 total messages per second. The vast majority of messages are < 4KiB.
The issue reproduces with:
Here is an example of a queue going into a bad state with 4.1.0 (I am happy to provide logs from earlier versions as well):
server 0:
server 1:
server 2:
I have attempted to create a reproducer program, but unfortunately I'm currently struggling to trigger the issue with non-proprietary code.
The issue also does not reproduce by just creating huge numbers of queues, it seems very timing dependent.
Reproduction steps
The script that I'm unfortunately unable to release at the moment attempts to simulate our system's behaviour:
Please note the above is significantly higher load than our production system is subjected to.
With this script I am usually able to get queues into this state within a few hours.
I was also unable to reproduce it under a local Kind cluster, so it may be necessary to simulate network and disk latency.
Expected behavior
Queues eventually recover from this state or the client receives an error/disconnect and can try again later.
Additional context
No response
The text was updated successfully, but these errors were encountered: