Cluster unresponsive during shutdown of broker #5175

michjans · 2022-07-07T14:36:27Z

michjans
Jul 7, 2022

Hi,
We are currently using RabbitMQ version 3.9.20 with Erlang 23.3.4.10 on CentOS 7.

We have a cluster with 3 brokers with (classic) mirroring enabled using: "ha-mode":"all","ha-sync-mode":"automatic". All queues are declared using: durable=true; exlusive=false; auto-delete=true.

When the system is up and running we have an average of 200 client connections and 111 queues and all queues are mirrored to all brokers (messages published: 15/s; received 99/s).

When we shutdown one broker using "systemctl stop rabbitmq-server", we notice that the broker starts going down and all it's client connections are immediately closed. The clients connect to the next available broker, but the other two brokers are not responding to the following AMQP requests, i.e. calls to Queue Declare from the clients to these brokers are getting no response.

This state takes about 3 minutes. Using 'netstat' we are still seeing established connections between the broker process (beam.smp) that is shutting down and the other 2 brokers. After 3 minutes a timeout in systemd finally kills the broker process (beam.smp) and after that the other 2 brokers finally notice that the broker is down ("node rabbit@xxxx down: connection_closed" is printed in their logs).

At the same time the clients finally get a response OK in their call to Queue Declare, and the message processing continues.

When we increased the systemd timeout to 15 minutes, the broker in the end shuts down by itself after 10 minutes (see time gap):

Jul 07 12:17:40 xxxx systemd[1]: Stopping RabbitMQ broker...
Jul 07 12:17:40 xxxx rabbitmqctl[23992]: Shutting down RabbitMQ node rabbit@xxxx running at PID 17485
Jul 07 12:27:45 xxxx rabbitmq-server[17485]: Gracefully halting Erlang VM
Jul 07 12:27:45 xxxx rabbitmqctl[23992]: Waiting for PID 17485 to terminate
Jul 07 12:27:54 xxxx rabbitmqctl[23992]: RabbitMQ node rabbit@xxxx running at PID 17485 successfully shut down
Jul 07 12:27:54 xxxx systemd[1]: Stopped RabbitMQ broker.

In the rabbitmq log file we see this (see also the time gap between 2022-07-07 12:17:40 and 2022-07-07 12:27:40:

2022-07-07 12:17:40.948252+00:00 [info] <0.8049.0> Closing all connections in vhost '/' on node 'rabbit@xxxx' because the vhost is stopping
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>     supervisor: {<0.8036.0>,rabbit_amqqueue_sup}
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>     errorContext: shutdown_error
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>     reason: killed
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>     offender: [{pid,<0.8037.0>},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                {id,rabbit_amqqueue},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                {mfargs,
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                    {rabbit_prequeue,start_link,
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                        [{amqqueue,
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                             {resource,<<"/">>,queue,
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                                 <<"XXXXXXXXXXXXXXXXXXXXXXXXXX">>},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                             true,true,none,[],<12573.4617.0>,[],[],[],
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                             [{vhost,<<"/">>},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                              {name,<<"HA-TTL">>},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                              {pattern,<<".*">>},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                              {'apply-to',<<"queues">>},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                              {definition,
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                                  [{<<"expires">>,60000},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                                   {<<"ha-mode">>,<<"all">>},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                                   {<<"ha-sync-mode">>,<<"automatic">>},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                                   {<<"message-ttl">>,30000}]},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                              {priority,0}],
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                             undefined,
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                             [{<12573.4636.0>,<12573.4617.0>}],
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                             [],live,0,[],<<"/">>,
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                             #{user => <<"admin">>},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                             rabbit_classic_queue,#{}},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                         slave,<0.8035.0>]}},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                {restart_type,intrinsic},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                {shutdown,600000},
2022-07-07 12:27:40.949197+00:00 [error] <0.8036.0>                {child_type,worker}]
2022-07-07 12:27:40.950989+00:00 [info] <0.495.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@xxxx/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2022-07-07 12:27:40.956060+00:00 [info] <0.495.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@xxxx/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2022-07-07 12:27:40.956492+00:00 [info] <0.491.0> Stopping message store for directory '/var/lib/rabbitmq/mnesia/rabbit@xxxx/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2022-07-07 12:27:40.959690+00:00 [info] <0.491.0> Message store for directory '/var/lib/rabbitmq/mnesia/rabbit@xxxx/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped
2022-07-07 12:27:40.962458+00:00 [info] <0.441.0> Management plugin: to stop collect_statistics.
2022-07-07 12:27:45.970881+00:00 [notice] <0.44.0> Application rabbit exited with reason: stopped

Can anybody explain why the shutdown of the broker takes so long that in the end the OS (systemd) just decides to kill the process?

When we were still using RabbitMQ 3.3.5 we didn't seem to have this problem, so can this have been introduced by a later version of RabbitMQ?

Any suggestions in how we can analyse this problem further?

Thanks!

michaelklishin · 2022-07-07T15:24:01Z

michaelklishin
Jul 7, 2022
Maintainer

Unlike 3.3, modern RabbitMQ versions use a pair of message stores per virtual host. If you have many of them, they will take longer to shut down.

1 reply

michjans Jul 7, 2022
Author

Thanks for your answer. This helps us to find the cause of the slow shutdown.

The problem is that the shutdown of one broker blocks the message flow through the other brokers. While, when doing a hard kill, the other brokers take over almost immediately.

Any hints how we could reduce the size of the message store? Is it dependent on the number of messages only or also from the number of queues, connections, etc.?

michaelklishin · 2022-07-07T15:29:05Z

michaelklishin
Jul 7, 2022
Maintainer

Enabling debug logging might help see more of the shutdown steps, e.g. for individual message stores.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster unresponsive during shutdown of broker #5175

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Cluster unresponsive during shutdown of broker #5175

Uh oh!

michjans Jul 7, 2022

Replies: 2 comments · 1 reply

Uh oh!

michaelklishin Jul 7, 2022 Maintainer

Uh oh!

michjans Jul 7, 2022 Author

Uh oh!

michaelklishin Jul 7, 2022 Maintainer

michjans
Jul 7, 2022

Replies: 2 comments 1 reply

michaelklishin
Jul 7, 2022
Maintainer

michjans Jul 7, 2022
Author

michaelklishin
Jul 7, 2022
Maintainer