Quorum queues can enter a state it cannot recover from due to a timeout #13828

michaelklishin · 2025-04-29T14:12:14Z

Moving back to an issue because @kjnilsson has identified something to address regardless of the quorum queue churn (a workload they were never designed for).

Originally filed by @matthew-s-walker.

Discussed in #13827

^{Originally posted by matthew-s-walker April 29, 2025}

Describe the bug

Hi,

Firstly, I want to thank you for your work on RabbitMQ. It has been a rock solid core component of our system for many years.

We migrated all of our queues to the quorum queue type recently but have unfortunately encountered stability problems in our production environment.
Our system creates temporary queues, often up to 50 across 1 second or so, and totalling roughly 20000 per day.

After migrating, we found that within a few hours some queues (typically several created at similar times) will go into a state where:

Publishing to the queue fails
Consumers from the queue think that they are consuming but are not shown as consuming in the management UI, or receive internal errors attempting to consume.
The issue either occurs immediately after/during creation or within 2-3 minutes of creation.

We can reproduce the behaviour on the following versions of RabbitMQ, but the errors logged by the servers are different in at least 4.1.0:

3.12.14
3.13.7
4.0.1
4.1.0

On 4.0.1 and below, we receive various "badmatch" "timeout" errors, which I can provide if wanted.

Our cluster setup is:

Kubernetes (EKS) with the cluster and topology operators
3 nodes, 2 CPUs allocated, 4Gi memory
10Gi EBS volumes (GP2)

Typical cluster load is < 1000 total queues, < 500 total messages per second. The vast majority of messages are < 4KiB.

The issue reproduces with:

Both continuous_membership_reconciliation enabled and disabled
channel_operation_timeout set to 60000 and set to the default
3 node and single node clusters

Here is an example of a queue going into a bad state with 4.1.0 (I am happy to provide logs from earlier versions as well):

2025-04-23 10:18:48.261339+00:00 [debug] <0.5779956.0> Will start up to 3 replicas for quorum queue queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido' with leader on node '[email protected]', initial machine version 5
2025-04-23 10:18:48.628726+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
2025-04-23 10:18:48.629252+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
2025-04-23 10:18:48.629809+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Generic server <0.5779956.0> terminating
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Last message in was {'$gen_cast',
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                            {method,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                {'queue.declare',0,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    false,true,false,false,false,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    [{<<"x-expires">>,signedint,7200000},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                     {<<"x-queue-type">>,longstr,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                      <<"quorum">>}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                none,noflow}}
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** When Server state == {ch,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {conf,running,rabbit_framing_amqp_0_9_1,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           <0.5779924.0>,<0.5779953.0>,<0.5779924.0>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           <<"10.47.104.51:3215 -> 10.47.88.201:5672">>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           {user,<<"xxxxxxx">>,[],
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                            [{rabbit_auth_backend_oauth2,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                              #Fun<rabbit_auth_backend_oauth2.5.36627292>}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           <<"lido">>,<<>>,<0.5779934.0>,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           [{<<"connection.blocked">>,bool,false},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                            {<<"consumer_cancel_notify">>,bool,true}],
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           none,0,1800000,#{},infinity,1000000000},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {lstate,<0.5779955.0>,false},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          none,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {0,[],[]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          undefined,#{},#{},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {state,fine,5000,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           #Ref<0.1001428202.3157000194.108089>},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          true,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {rabbit_confirms,undefined,#{}},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          [],[],none,flow,[],
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          {rabbit_queue_type,#{}},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                          #Ref<0.1001428202.3157000194.108085>,false}
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** Reason for termination ==
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0> ** {{exception,partition_parallel_timeout},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>     [{erpc,call,5,[{file,"erpc.erl"},{line,264}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_queue_type_util,erpc_call,5,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                              [{file,"rabbit_queue_type_util.erl"},{line,81}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_quorum_queue,start_cluster,1,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                           [{file,"rabbit_quorum_queue.erl"},{line,283}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_channel,handle_method,6,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                      [{file,"rabbit_channel.erl"},{line,2437}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_channel,handle_method,3,
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                      [{file,"rabbit_channel.erl"},{line,1564}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {rabbit_channel,handle_cast,2,[{file,"rabbit_channel.erl"},{line,615}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]},
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}
2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>   crasher:
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     initial call: rabbit_channel:init/1
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     pid: <0.5779956.0>
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     registered_name: []
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     exception exit: {{exception,partition_parallel_timeout},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                      [{erpc,call,5,[{file,"erpc.erl"},{line,264}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_queue_type_util,erpc_call,5,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_queue_type_util.erl"},{line,81}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_quorum_queue,start_cluster,1,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_quorum_queue.erl"},{line,283}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,handle_method,6,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_channel.erl"},{line,2437}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,handle_method,3,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_channel.erl"},{line,1564}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,handle_cast,2,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"rabbit_channel.erl"},{line,615}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {gen_server2,handle_msg,2,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"gen_server2.erl"},{line,1056}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {proc_lib,init_p_do_apply,3,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           [{file,"proc_lib.erl"},{line,329}]}]}
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     ancestors: [<0.5779951.0>,<0.5779948.0>,<0.5779909.0>,<0.5779925.0>,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   <0.874.0>,<0.873.0>,<0.872.0>,<0.870.0>,<0.869.0>,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   rabbit_sup,<0.216.0>]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     message_queue_len: 5
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     messages: [{<0.5780268.0>,true},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {'DOWN',#Ref<0.1001428202.3157000194.108819>,process,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           <0.5780268.0>,normal},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   emit_stats,emit_gen_server2_stats,tick]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     links: [<0.5779951.0>]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     dictionary: [{rand_seed,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {#{max => 288230376151711743,type => exsplus,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                          next => #Fun<rand.5.40079776>,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                          jump => #Fun<rand.3.40079776>},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                        [161353397928511601|287143983225069820]}},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {permission_cache,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       [{{resource,<<"lido">>,queue,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                             <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                         #{},configure}]},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {channel_operation_timeout,60000},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {permission_cache_can_expire,true},
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                   {process_name,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                       {rabbit_channel,
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                           {<<"10.47.104.51:3215 -> 10.47.88.201:5672">>,1}}}]
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     trap_exit: true
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     status: running
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     heap_size: 10958
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     stack_size: 29
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>     reductions: 54575
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>   neighbours:
2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     supervisor: {<0.5779951.0>,rabbit_channel_sup}
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     errorContext: child_terminated
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     reason: {{exception,partition_parallel_timeout},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>              [{erpc,call,5,[{file,"erpc.erl"},{line,264}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_queue_type_util,erpc_call,5,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                       [{file,"rabbit_queue_type_util.erl"},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                        {line,81}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_quorum_queue,start_cluster,1,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                    [{file,"rabbit_quorum_queue.erl"},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                     {line,283}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_channel,handle_method,6,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                               [{file,"rabbit_channel.erl"},{line,2437}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_channel,handle_method,3,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                               [{file,"rabbit_channel.erl"},{line,1564}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {rabbit_channel,handle_cast,2,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                               [{file,"rabbit_channel.erl"},{line,615}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {gen_server2,handle_msg,2,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                            [{file,"gen_server2.erl"},{line,1056}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>               {proc_lib,init_p_do_apply,3,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         [{file,"proc_lib.erl"},{line,329}]}]}
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>     offender: [{pid,<0.5779956.0>},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {id,channel},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {mfargs,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                    {rabbit_channel,start_link,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                        [1,<0.5779924.0>,<0.5779953.0>,<0.5779924.0>,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         <<"10.47.104.51:3215 -> 10.47.88.201:5672">>,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         rabbit_framing_amqp_0_9_1,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         {user,<<"xxxxxxx">>,[],
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                             [{rabbit_auth_backend_oauth2,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                                  #Fun<rabbit_auth_backend_oauth2.5.36627292>}]},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         <<"lido">>,
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         [{<<"connection.blocked">>,bool,false},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                          {<<"consumer_cancel_notify">>,bool,true}],
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                         <0.5779934.0>,<0.5779955.0>]}},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {restart_type,transient},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {significant,true},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {shutdown,70000},
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>                {child_type,worker}]
2025-04-23 10:19:49.737747+00:00 [error] <0.5779951.0>
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     supervisor: {<0.5779951.0>,rabbit_channel_sup}
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     errorContext: shutdown
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     reason: reached_max_restart_intensity
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>     offender: [{pid,<0.5779956.0>},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {id,channel},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {mfargs,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                    {rabbit_channel,start_link,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                        [1,<0.5779924.0>,<0.5779953.0>,<0.5779924.0>,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         <<"10.47.104.51:3215 -> 10.47.88.201:5672">>,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         rabbit_framing_amqp_0_9_1,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         {user,<<"xxxxxxx">>,[],
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                             [{rabbit_auth_backend_oauth2,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                                  #Fun<rabbit_auth_backend_oauth2.5.36627292>}]},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         <<"lido">>,
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         [{<<"connection.blocked">>,bool,false},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                          {<<"consumer_cancel_notify">>,bool,true}],
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                         <0.5779934.0>,<0.5779955.0>]}},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {restart_type,transient},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {significant,true},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {shutdown,70000},
2025-04-23 10:19:49.742143+00:00 [error] <0.5779951.0>                {child_type,worker}]

server 0:

574895:2025-04-23 10:18:48.261339+00:00 [debug] <0.5779956.0> Will start up to 3 replicas for quorum queue queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido' with leader on node '[email protected]', initial machine version 5
574919:2025-04-23 10:18:48.628726+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
574920:2025-04-23 10:18:48.629252+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
574921:2025-04-23 10:18:48.629809+00:00 [info] <0.5779956.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': delivery_limit not set, defaulting to 20
574939:2025-04-23 10:18:49.378959+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0}
574959:2025-04-23 10:18:49.726729+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0
574960:2025-04-23 10:18:49.726869+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0
574961:2025-04-23 10:18:49.727319+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms
574962:2025-04-23 10:18:49.727468+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0
574963:2025-04-23 10:18:49.727641+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0
574964:2025-04-23 10:18:49.727818+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0
592793:2025-04-23 10:19:49.734078+00:00 [error] <0.5779956.0>                                    <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>,
592873:2025-04-23 10:19:49.734973+00:00 [error] <0.5779956.0>                             <<"91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9">>},
595996:2025-04-23 10:19:54.884806+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
669352:2025-04-23 10:23:14.944921+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
678457:2025-04-23 10:24:49.023934+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': follower received unhandled msg: {ra_log_event,{segments,[{...}],[...]}}
713621:2025-04-23 10:26:29.950377+00:00 [error] <0.5914313.0>  operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': {'lido_91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9',\n                                                                                                                                            '[email protected]'}"
714189:2025-04-23 10:26:34.982961+00:00 [debug] <0.5780275.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.

server 1:

628440:2025-04-23 10:18:49.202017+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0}
628461:2025-04-23 10:18:49.262588+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0
628462:2025-04-23 10:18:49.266395+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0
628463:2025-04-23 10:18:49.266568+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms
628464:2025-04-23 10:18:49.267539+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0
628465:2025-04-23 10:18:49.267793+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0
628466:2025-04-23 10:18:49.268106+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0
677717:2025-04-23 10:22:04.886201+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
754750:2025-04-23 10:25:20.161779+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': follower received unhandled msg: {ra_log_event,{segments,[{...}],[...]}}
754983:2025-04-23 10:25:24.949112+00:00 [debug] <0.5721677.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.

server 2:

648735:2025-04-23 10:19:50.911784+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': ra_log:init recovered last_index_term {0,0} snapshot_index_term {-1,-1}, last_written_index_term {0,0}
648803:2025-04-23 10:19:51.159007+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': post_init -> recover in term: 0 machine version: 5, last applied 0
648815:2025-04-23 10:19:51.167261+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovering state machine version 5:5 from index 0 to 0
648840:2025-04-23 10:19:51.329633+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovery of state machine version 5:5 from index 0 to 0 took 0ms
648938:2025-04-23 10:19:51.367773+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': scanning for cluster changes 1:0
648946:2025-04-23 10:19:51.397766+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recover -> recovered in term: 0 machine version: 5, last applied 0
648955:2025-04-23 10:19:51.398296+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': recovered -> follower in term: 0 machine version: 5, last applied 0
662995:2025-04-23 10:20:59.884392+00:00 [debug] <0.5797920.0> queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': leader call - leader not known. Command will be forwarded once leader is known.
704973:2025-04-23 10:23:09.886627+00:00 [error] <0.5830650.0>  operation basic.consume caused a connection exception internal_error: "timed out consuming from quorum queue '91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9' in vhost 'lido': {'lido_91ee5486-d86a-4a43-9be9-63689f2b55fa.request-405493a5-cb34-4fc2-aeda-f2f7df97abf9',\n                                                                                                                                            '[email protected]'}"

I have attempted to create a reproducer program, but unfortunately I'm currently struggling to trigger the issue with non-proprietary code.
The issue also does not reproduce by just creating huge numbers of queues, it seems very timing dependent.

Reproduction steps

The script that I'm unfortunately unable to release at the moment attempts to simulate our system's behaviour:

In parallel (16x concurrency):
- Create 5-50 quorum queues (random), each using separate connections, wait for all to complete
- Consume from all the queues, also on separate connections
- Using a single connection, publish a message to all of the queues over a 5 second period
- Consumers disconnect after receiving a message
- If the consumer does not receive a message within 2 minutes, then we have encountered the broken state
- Delete all the queues, also using separate connections
- Sleep for from 0 to 5 seconds, repeat

Please note the above is significantly higher load than our production system is subjected to.

With this script I am usually able to get queues into this state within a few hours.
I was also unable to reproduce it under a local Kind cluster, so it may be necessary to simulate network and disk latency.

Expected behavior

Queues eventually recover from this state or the client receives an error/disconnect and can try again later.

Additional context

No response

The text was updated successfully, but these errors were encountered:

lukebakken · 2025-04-29T15:51:30Z

The script that I'm unfortunately unable to release at the moment

@matthew-s-walker the sooner you can, the better. That way when a fix is made, a means to test it is readily available.

That said we can leave this as an issue as we can handle the partition_parallel timeout better to avoid leaving a queue record and stuck queue servers behind.

@kjnilsson - would having a script to reproduce this issue speed up your resolution? Or is it just necessary to test this issue?

kjnilsson · 2025-04-29T16:03:26Z

The script that I'm unfortunately unable to release at the moment

@matthew-s-walker the sooner you can, the better. That way when a fix is made, a means to test it is readily available.

That said we can leave this as an issue as we can handle the partition_parallel timeout better to avoid leaving a queue record and stuck queue servers behind.

@kjnilsson - would having a script to reproduce this issue speed up your resolution? Or is it just necessary to test this issue?

because it will be very environment specific I don't think a script would help. In short we just need to handle the timeout and rollback the queue creation like we do for other errors.

matthew-s-walker · 2025-05-13T16:10:55Z

Thanks for the quick responses all. Apologies for the delay, it's been very busy.

You're right on the money spotting the IOPS issue - I can confirm that the queue creation stress test ran at ~35 queues creations/deletions per second for 2 days straight without any issue after switching the volumes over to GP3.

I'll note again that this is significantly higher queue attrition than our production system I was just stating the absolute worst case scenario and the configuration required to actually reproduce the issue outside of our production environment. Typical queue creation rates are <1/s.

I suspect that we will wait until whichever release this this fix eventually goes into before switching back to quorum queues, as we wouldn't want to risk a similar repeat incident. Unfortunately, non-highly available queues are no longer acceptable for our use case - the repercussions of node failures that we've experienced recently are too great.

@kjnilsson In terms of testing any fixes, it's pretty easy for me to get our system back into the original configuration and I'd be very happy to run the test against a pre-release container image or do manual container builds if that process is straight forward. I imagine it's possible that there are other areas which need extra timeout handling as well. Please let me know if you end up prioritising this ticket and would like me to do that.

michaelklishin · 2025-05-14T14:15:27Z

I am clarifying with @kjnilsson where specifically we want those timeouts to be handled.

michaelklishin · 2025-05-14T14:33:16Z

So the place that needs to handle an error and delete (roll back) QQ cluster member placement is https://github.com/rabbitmq/ra/blob/ae8cbf2de8325d9665d3789b1d4817f5ddee60cb/src/ra.erl#L396, and to make that more feasible, ra_lib:partition_parallel/2 return value should change to something like

{ok, Started, NotStarted} | {error, timeout}

instead of throwing an error.

rabbitmq/ra#539 describes the changes in Ra.

Closes #539. References rabbitmq/rabbitmq-server#13828.

michaelklishin · 2025-05-19T03:17:07Z

rabbitmq/ra#540 is in, we just need to produce a new ra release and bump it.

michaelklishin · 2025-06-04T20:36:47Z

Ra was bumped to 2.16.9 in #13913.

michaelklishin added bug backport-v4.1.x labels Apr 29, 2025

michaelklishin mentioned this issue May 14, 2025

Handle parallel computation timeouts when forming a cluster rabbitmq/ra#539

Closed

michaelklishin added a commit to rabbitmq/ra that referenced this issue May 15, 2025

ra:start_cluster/3: handle timeouts

1059549

Closes #539. References rabbitmq/rabbitmq-server#13828.

michaelklishin mentioned this issue May 15, 2025

ra:start_cluster/3: handle timeouts returned by ra_lib:parallel_partition/3 rabbitmq/ra#540

Merged

michaelklishin added a commit to rabbitmq/ra that referenced this issue May 16, 2025

ra:start_cluster/3: handle timeouts

cf9609f

Closes #539. References rabbitmq/rabbitmq-server#13828.

michaelklishin added this to the 4.1.1 milestone May 19, 2025

michaelklishin closed this as completed Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quorum queues can enter a state it cannot recover from due to a timeout #13828

Quorum queues can enter a state it cannot recover from due to a timeout #13828

michaelklishin commented Apr 29, 2025 •

edited

Loading

Describe the bug

Reproduction steps

Expected behavior

Additional context

lukebakken commented Apr 29, 2025

Uh oh!

kjnilsson commented Apr 29, 2025

Uh oh!

matthew-s-walker commented May 13, 2025

Uh oh!

michaelklishin commented May 14, 2025

Uh oh!

michaelklishin commented May 14, 2025 •

edited

Loading

Uh oh!

michaelklishin commented May 19, 2025

Uh oh!

michaelklishin commented Jun 4, 2025

Uh oh!

Quorum queues can enter a state it cannot recover from due to a timeout #13828

Quorum queues can enter a state it cannot recover from due to a timeout #13828

Comments

michaelklishin commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

Reproduction steps

Expected behavior

Additional context

lukebakken commented Apr 29, 2025

Uh oh!

kjnilsson commented Apr 29, 2025

Uh oh!

matthew-s-walker commented May 13, 2025

Uh oh!

michaelklishin commented May 14, 2025

Uh oh!

michaelklishin commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelklishin commented May 19, 2025

Uh oh!

michaelklishin commented Jun 4, 2025

Uh oh!

michaelklishin commented Apr 29, 2025 •

edited

Loading

michaelklishin commented May 14, 2025 •

edited

Loading