Skip to content

Quorum queues can enter a state it cannot recover from due to a timeout #13828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
michaelklishin opened this issue Apr 29, 2025 · 7 comments
Closed

Comments

@michaelklishin
Copy link
Collaborator

michaelklishin commented Apr 29, 2025

Moving back to an issue because @kjnilsson has identified something to address regardless of the quorum queue churn (a workload they were never designed for).

Originally filed by @matthew-s-walker.

Discussed in #13827

Originally posted by matthew-s-walker April 29, 2025

Describe the bug

Hi,

Firstly, I want to thank you for your work on RabbitMQ. It has been a rock solid core component of our system for many years.

We migrated all of our queues to the quorum queue type recently but have unfortunately encountered stability problems in our production environment.
Our system creates temporary queues, often up to 50 across 1 second or so, and totalling roughly 20000 per day.

After migrating, we found that within a few hours some queues (typically several created at similar times) will go into a state where:

  • Publishing to the queue fails
  • Consumers from the queue think that they are consuming but are not shown as consuming in the management UI, or receive internal errors attempting to consume.
    The issue either occurs immediately after/during creation or within 2-3 minutes of creation.

We can reproduce the behaviour on the following versions of RabbitMQ, but the errors logged by the servers are different in at least 4.1.0:

  • 3.12.14
  • 3.13.7
  • 4.0.1
  • 4.1.0

On 4.0.1 and below, we receive various "badmatch" "timeout" errors, which I can provide if wanted.

Our cluster setup is:

  • Kubernetes (EKS) with the cluster and topology operators
  • 3 nodes, 2 CPUs allocated, 4Gi memory
  • 10Gi EBS volumes (GP2)

Typical cluster load is < 1000 total queues, < 500 total messages per second. The vast majority of messages are < 4KiB.

The issue reproduces with:

  • Both continuous_membership_reconciliation enabled and disabled
  • channel_operation_timeout set to 60000 and set to the default
  • 3 node and single node clusters

Here is an example of a queue going into a bad state with 4.1.0 (I am happy to provide logs from earlier versions as well):

2025-04-23 10:18:48.261339+00:00 [debug] <0.5779956.0> Will start up to 3 replicas for quorum queue queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido' with leader on node '[email protected]', initial machine version 5
2025-04-23 10:18:48.628726+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
2025-04-23 10:18:48.629252+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
2025-04-23 10:18:48.629809+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Generic server <0.5779956.0> terminating
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Last message in was {'$gen_cast',
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                            {method,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                {'queue.declare',0,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    false,true,false,false,false,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    [{<<"x-expires">>,signedint,7200000},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                     {<<"x-queue-type">>,longstr,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                      <<"quorum">>}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                none,noflow}}
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** When Server state == {ch,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {conf,running,rabbit_framing_amqp_0_9_1,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           <0.5779924.0>,<0.5779953.0>,<0.5779924.0>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           <<"10.47.104.51:3215 -> 10.47.88.201:5672">>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           {user,<<"xxxxxxx">>,[],
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                            [{rabbit_auth_backend_oauth2,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                              #Fun<rabbit_auth_backend_oauth2.5.36627292>}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           <<"lido">>,<<>>,<0.5779934.0>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           [{<<"connection.blocked">>,bool,false},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                            {<<"consumer_cancel_notify">>,bool,true}],
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           none,0,1800000,#{},infinity,1000000000},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {lstate,<0.5779955.0>,false},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          none,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {0,[],[]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          undefined,#{},#{},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {state,fine,5000,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           #Ref<0.1001428202.3157000194.108089>},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          true,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {rabbit_confirms,undefined,#{}},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          [],[],none,flow,[],
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {rabbit_queue_type,#{}},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          #Ref<0.1001428202.3157000194.108085>,false}
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Reason for termination ==
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** {{exception,partition_parallel_timeout},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>     [{erpc,call,5,[{file,"erpc.erl"},{line,264}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_queue_type_util,erpc_call,5,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                              [{file,"rabbit_queue_type_util.erl"},{line,81}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_quorum_queue,start_cluster,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           [{file,"rabbit_quorum_queue.erl"},{line,283}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_channel,handle_method,6,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                      [{file,"rabbit_channel.erl"},{line,2437}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_channel,handle_method,3,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                      [{file,"rabbit_channel.erl"},{line,1564}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_channel,handle_cast,2,[{file,"rabbit_channel.erl"},{line,615}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>   crasher:
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     initial call: rabbit_channel:init/1
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     pid: <0.5779956.0>
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     registered_name: []
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     exception exit: {{exception,partition_parallel_timeout},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                      [{erpc,call,5,[{file,"erpc.erl"},{line,264}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_queue_type_util,erpc_call,5,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_queue_type_util.erl"},{line,81}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_quorum_queue,start_cluster,1,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_quorum_queue.erl"},{line,283}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,handle_method,6,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_channel.erl"},{line,2437}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,handle_method,3,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_channel.erl"},{line,1564}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,handle_cast,2,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_channel.erl"},{line,615}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {gen_server2,handle_msg,2,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"gen_server2.erl"},{line,1056}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {proc_lib,init_p_do_apply,3,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"proc_lib.erl"},{line,329}]}]}
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     ancestors: [<0.5779951.0>,<0.5779948.0>,<0.5779909.0>,<0.5779925.0>,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   <0.874.0>,<0.873.0>,<0.872.0>,<0.870.0>,<0.869.0>,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   rabbit_sup,<0.216.0>]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     message_queue_len: 5
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     messages: [{<0.5780268.0>,true},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {'DOWN',#Ref<0.1001428202.3157000194.108819>,process,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           <0.5780268.0>,normal},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   emit_stats,emit_gen_server2_stats,tick]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     links: [<0.5779951.0>]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     dictionary: [{rand_seed,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {#{max => 288230376151711743,type => exsplus,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                          next => #Fun<rand.5.40079776>,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                          jump => #Fun<rand.3.40079776>},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                        [161353397928511601|287143983225069820]}},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {permission_cache,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       [{{resource,<<"lido">>,queue,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                             <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                         #{},configure}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {channel_operation_timeout,60000},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {permission_cache_can_expire,true},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {process_name,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           {<<"10.47.104.51:3215 -> 10.47.88.201:5672">>,1}}}]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     trap_exit: true
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     status: running
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     heap_size: 10958
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     stack_size: 29
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     reductions: 54575
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>   neighbours:
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     supervisor: {<0.5779951.0>,rabbit_channel_sup}
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     errorContext: child_terminated
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     reason: {{exception,partition_parallel_timeout},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>              [{erpc,call,5,[{file,"erpc.erl"},{line,264}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_queue_type_util,erpc_call,5,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                       [{file,"rabbit_queue_type_util.erl"},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                        {line,81}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_quorum_queue,start_cluster,1,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                    [{file,"rabbit_quorum_queue.erl"},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                     {line,283}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_channel,handle_method,6,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                               [{file,"rabbit_channel.erl"},{line,2437}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_channel,handle_method,3,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                               [{file,"rabbit_channel.erl"},{line,1564}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_channel,handle_cast,2,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                               [{file,"rabbit_channel.erl"},{line,615}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {gen_server2,handle_msg,2,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                            [{file,"gen_server2.erl"},{line,1056}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {proc_lib,init_p_do_apply,3,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         [{file,"proc_lib.erl"},{line,329}]}]}
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     offender: [{pid,<0.5779956.0>},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {id,channel},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {mfargs,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                    {rabbit_channel,start_link,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                        [1,<0.5779924.0>,<0.5779953.0>,<0.5779924.0>,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         <<"10.47.104.51:3215 -> 10.47.88.201:5672">>,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         rabbit_framing_amqp_0_9_1,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         {user,<<"xxxxxxx">>,[],
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                             [{rabbit_auth_backend_oauth2,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                  #Fun<rabbit_auth_backend_oauth2.5.36627292>}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         <<"lido">>,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         [{<<"connection.blocked">>,bool,false},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                          {<<"consumer_cancel_notify">>,bool,true}],
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         <0.5779934.0>,<0.5779955.0>]}},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {restart_type,transient},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {significant,true},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {shutdown,70000},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {child_type,worker}]
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     supervisor: {<0.5779951.0>,rabbit_channel_sup}
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     errorContext: shutdown
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     reason: reached_max_restart_intensity
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     offender: [{pid,<0.5779956.0>},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {id,channel},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {mfargs,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                    {rabbit_channel,start_link,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                        [1,<0.5779924.0>,<0.5779953.0>,<0.5779924.0>,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         <<"10.47.104.51:3215 -> 10.47.88.201:5672">>,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         rabbit_framing_amqp_0_9_1,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         {user,<<"xxxxxxx">>,[],
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                             [{rabbit_auth_backend_oauth2,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                                  #Fun<rabbit_auth_backend_oauth2.5.36627292>}]},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         <<"lido">>,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         [{<<"connection.blocked">>,bool,false},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                          {<<"consumer_cancel_notify">>,bool,true}],
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         <0.5779934.0>,<0.5779955.0>]}},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {restart_type,transient},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {significant,true},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {shutdown,70000},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {child_type,worker}]

server 0:

574895:2025-04-23 10:18:48.261339+00:00 [debug] <0.5779956.0> Will start up to 3 replicas for quorum queue queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido' with leader on node '[email protected]', initial machine version 5
574919:2025-04-23 10:18:48.628726+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
574920:2025-04-23 10:18:48.629252+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
574921:2025-04-23 10:18:48.629809+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
574939:2025-04-23 10:18:49.378959+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0}
574959:2025-04-23 10:18:49.726729+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0
574960:2025-04-23 10:18:49.726869+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0
574961:2025-04-23 10:18:49.727319+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms
574962:2025-04-23 10:18:49.727468+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0
574963:2025-04-23 10:18:49.727641+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0
574964:2025-04-23 10:18:49.727818+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0
592793:2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>,
592873:2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                             <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>},
595996:2025-04-23 10:19:54.884806+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
669352:2025-04-23 10:23:14.944921+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
678457:2025-04-23 10:24:49.023934+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': follower received unhandled msg: {ra_log_event,{segments,[{...}],[...]}}
713621:2025-04-23 10:26:29.950377+00:00 [error] <0.5914313.0>  operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': {'lido_91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9',\n                                                                                                                                            '[email protected]'}"
714189:2025-04-23 10:26:34.982961+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.

server 1:

628440:2025-04-23 10:18:49.202017+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0}
628461:2025-04-23 10:18:49.262588+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0
628462:2025-04-23 10:18:49.266395+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0
628463:2025-04-23 10:18:49.266568+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms
628464:2025-04-23 10:18:49.267539+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0
628465:2025-04-23 10:18:49.267793+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0
628466:2025-04-23 10:18:49.268106+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0
677717:2025-04-23 10:22:04.886201+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
754750:2025-04-23 10:25:20.161779+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': follower received unhandled msg: {ra_log_event,{segments,[{...}],[...]}}
754983:2025-04-23 10:25:24.949112+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.

server 2:

648735:2025-04-23 10:19:50.911784+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0}
648803:2025-04-23 10:19:51.159007+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0
648815:2025-04-23 10:19:51.167261+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0
648840:2025-04-23 10:19:51.329633+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms
648938:2025-04-23 10:19:51.367773+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0
648946:2025-04-23 10:19:51.397766+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0
648955:2025-04-23 10:19:51.398296+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0
662995:2025-04-23 10:20:59.884392+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
704973:2025-04-23 10:23:09.886627+00:00 [error] <0.5830650.0>  operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': {'lido_91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9',\n                                                                                                                                            '[email protected]'}"

I have attempted to create a reproducer program, but unfortunately I'm currently struggling to trigger the issue with non-proprietary code.
The issue also does not reproduce by just creating huge numbers of queues, it seems very timing dependent.

Reproduction steps

The script that I'm unfortunately unable to release at the moment attempts to simulate our system's behaviour:

  • In parallel (16x concurrency):
    • Create 5-50 quorum queues (random), each using separate connections, wait for all to complete
    • Consume from all the queues, also on separate connections
    • Using a single connection, publish a message to all of the queues over a 5 second period
    • Consumers disconnect after receiving a message
    • If the consumer does not receive a message within 2 minutes, then we have encountered the broken state
    • Delete all the queues, also using separate connections
    • Sleep for from 0 to 5 seconds, repeat

Please note the above is significantly higher load than our production system is subjected to.

With this script I am usually able to get queues into this state within a few hours.
I was also unable to reproduce it under a local Kind cluster, so it may be necessary to simulate network and disk latency.

Expected behavior

Queues eventually recover from this state or the client receives an error/disconnect and can try again later.

Additional context

No response

@lukebakken
Copy link
Collaborator

The script that I'm unfortunately unable to release at the moment

@matthew-s-walker the sooner you can, the better. That way when a fix is made, a means to test it is readily available.

That said we can leave this as an issue as we can handle the partition_parallel timeout better to avoid leaving a queue record and stuck queue servers behind.

@kjnilsson - would having a script to reproduce this issue speed up your resolution? Or is it just necessary to test this issue?

@kjnilsson
Copy link
Contributor

The script that I'm unfortunately unable to release at the moment

@matthew-s-walker the sooner you can, the better. That way when a fix is made, a means to test it is readily available.

That said we can leave this as an issue as we can handle the partition_parallel timeout better to avoid leaving a queue record and stuck queue servers behind.

@kjnilsson - would having a script to reproduce this issue speed up your resolution? Or is it just necessary to test this issue?

because it will be very environment specific I don't think a script would help. In short we just need to handle the timeout and rollback the queue creation like we do for other errors.

@matthew-s-walker
Copy link

Thanks for the quick responses all. Apologies for the delay, it's been very busy.

You're right on the money spotting the IOPS issue - I can confirm that the queue creation stress test ran at ~35 queues creations/deletions per second for 2 days straight without any issue after switching the volumes over to GP3.

I'll note again that this is significantly higher queue attrition than our production system I was just stating the absolute worst case scenario and the configuration required to actually reproduce the issue outside of our production environment. Typical queue creation rates are <1/s.

I suspect that we will wait until whichever release this this fix eventually goes into before switching back to quorum queues, as we wouldn't want to risk a similar repeat incident. Unfortunately, non-highly available queues are no longer acceptable for our use case - the repercussions of node failures that we've experienced recently are too great.

@kjnilsson In terms of testing any fixes, it's pretty easy for me to get our system back into the original configuration and I'd be very happy to run the test against a pre-release container image or do manual container builds if that process is straight forward. I imagine it's possible that there are other areas which need extra timeout handling as well. Please let me know if you end up prioritising this ticket and would like me to do that.

@michaelklishin
Copy link
Collaborator Author

I am clarifying with @kjnilsson where specifically we want those timeouts to be handled.

@michaelklishin
Copy link
Collaborator Author

michaelklishin commented May 14, 2025

So the place that needs to handle an error and delete (roll back) QQ cluster member placement is https://github.com/rabbitmq/ra/blob/ae8cbf2de8325d9665d3789b1d4817f5ddee60cb/src/ra.erl#L396, and to make that more feasible, ra_lib:partition_parallel/2 return value should change to something like

{ok, Started, NotStarted} | {error, timeout}

instead of throwing an error.

rabbitmq/ra#539 describes the changes in Ra.

@michaelklishin
Copy link
Collaborator Author

rabbitmq/ra#540 is in, we just need to produce a new ra release and bump it.

@michaelklishin michaelklishin added this to the 4.1.1 milestone May 19, 2025
@michaelklishin
Copy link
Collaborator Author

Ra was bumped to 2.16.9 in #13913.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants