Reconcile QQ node dead during delete and redeclare #14241

LoisSotoLopez · 2025-07-16T11:11:49Z

Proposed Changes

This PR implements the suggested solution for the issue described in discussion #13131

Currently, when a QQ is deleted and re-declared while one of its nodes is dead, this dead node won't be able to reconcile with the new queue.

In this PR we add the list of ra UIds for the cluster to each node queue record, so that when a Rabbit node recovers a queue it will be able to detect the situation described above and properly reconcile .

Types of Changes

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

kjnilsson · 2025-07-16T15:28:42Z

thanks @LoisSotoLopez - this looks like it will do what we discussed a while back. I feel unsure about adding another key to the queue type state for this, mainly becuause we'd have to keep uids and nodes in sync. It would be nicer if nodes turned from a list into a map although even this is a bit controversial and could become a source of bugs. Let me consider it for a day or two.

kjnilsson

I don't like having two keys with similar information (nodes) that will need to be kept in sync. I think we need to move the current nodes list value to the the #{node() => uid()} map format and handle the two formats in the relevant places, mostly in rabbit_queue_type. we do need a list of the member nodes in a few places but we could add a convenience functio: rabbit_queue_type:nodes/1 that takes a queue record and returns a list of member nodes. internally it could just call rabbit_queue_type:info(Q, [members]) and extract the result from that then update all places where we explicity use get_type_state to extract the member nodes.

In addition I think we need to put the use of nodes as a map behind a feature flag to avoid new queue records with nodes map values being created in a mixed versions cluster.

LoisSotoLopez · 2025-07-25T12:09:00Z

@kjnilsson Thanks for the suggestions. Just wanted to let you know we are working on this. Had some incident we had to take care of this week but I'll be pushing this PR forward next week.

michaelklishin · 2025-07-28T21:48:53Z

@LoisSotoLopez we have to ask your employer to sign the Broadcom CLA before we can accept this contribution (or its future finished version).

It is about one page long, nothing particularly unusual or onerous.

LoisSotoLopez · 2025-08-01T08:58:02Z

That commit below is just for showing current progress. Have been struggling to understand why a few of the remaining tests fail.

deps/rabbit/src/rabbit_quorum_queue.erl

michaelklishin · 2025-08-05T16:20:31Z

@LoisSotoLopez sorry, any updates or feedback on the new CLA? We cannot accept any contributions without a signed CLA at the moment, and this change won't qualify for a "trivial" one, even if those get exempt in the future.

We are trying to make it digital one way or another but there's a risk that the process will stay what it currently is (just a document to sign and email).

LoisSotoLopez · 2025-08-06T07:38:01Z

@LoisSotoLopez sorry, any updates or feedback on the new CLA? We cannot accept any contributions without a signed CLA at the moment, and this change won't qualify for a "trivial" one, even if those get exempt in the future.

We are trying to make it digital one way or another but there's a risk that the process will stay what it currently is (just a document to sign and email).

Yes, sorry about not having this already addressed. We are on summer vacations right now so the people who will be signing it for the whole company won't be able to do it until next week. I would do it myself but can't due to a intellectual property clause on my contract that wasn't there the last time I signed the CLA.

Will try to get it signed asap.

LoisSotoLopez · 2025-08-13T14:28:47Z

Hi! The CLA should be signed now. This should be ready to be reviewed again.

michaelklishin · 2025-08-13T14:43:28Z

@LoisSotoLopez I'm afraid I do not see any inbound emails with the signed document. Perhaps it was not sent to teamrabbitmq <@> gmail, as the CLA repo suggests but rather straight to opensource <@> broadcom.

LoisSotoLopez · 2025-08-13T16:27:36Z

@LoisSotoLopez I'm afraid I do not see any inbound emails with the signed document. Perhaps it was not sent to teamrabbitmq <@> gmail, as the CLA repo suggests but rather straight to opensource <@> broadcom.

Should be there now. 🫡

michaelklishin · 2025-08-13T17:11:58Z

@LoisSotoLopez thank you, we should be all set in terms of the CLA.

I'll defer merging to @kjnilsson. That can take another week or two.

LoisSotoLopez · 2025-09-29T13:07:54Z

Just rebased and squashed this PR. Let me know if there's anything new I can do to move it forward.

michaelklishin · 2025-09-29T15:11:03Z

@kjnilsson

michaelklishin · 2025-09-29T15:20:08Z

@LoisSotoLopez given that this changes how member nodes are represented in the queue state, it is too late for us to get it into 4.2.0 and the earliest this PR can be considered is for 4.3.0, the work on which will begin in earnest in November.

gomoripeti · 2025-09-29T20:27:22Z

It is responsible to not rush something released that changes the queue record.

Because 4.3 is 6 months away I wonder if it would have any benefit to still include that part of the PR in 4.2 which moves get_name from amqqueue to rabbit_amqqueue module. That change is just a refactoring but touches multiple modules while the problematic part is restricted to rabbit_quorum_queue. I'm suggesting this because maybe it avoids some conflicts and makes it easier to backport other changes to 4.2 when the whole of this PR is merged into main but not into 4.2.x. Or if the whole of this PR is not merged for a long time it would make it easier to rebase it to a future main.
Just an idea, what do you think?

kjnilsson

A few more changes, this feature is a bit trickier than I had initially expected. I am happy take over if you feel like we're going back and forth too much. Just let me know.

deps/rabbit/src/rabbit_amqqueue.erl

deps/rabbit/src/rabbit_quorum_queue.erl

deps/rabbit/src/rabbit_amqqueue.erl

deps/rabbit/src/rabbit_quorum_queue.erl

michaelklishin · 2025-10-01T14:34:48Z

Test failures are not flakes. They key part is:

{badmatch,[]}

[{rabbit_queue_type,get_nodes,1,[{file,\"rabbit_queue_type.erl\"}, {line,428}]},

{rabbit_amqp_management,queue_topology,1, [{file,\"rabbit_amqp_management.erl\"},{line,474}]},\n 

{rabbit_amqp_management,encode_queue,3, [{file,\"rabbit_amqp_management.erl\"},{line,438}]}

kjnilsson · 2026-01-07T15:49:22Z

@LoisSotoLopez there are 5 failing tests according to ci - please can you take a look?

michaelklishin · 2026-01-07T16:02:27Z

Those can be flakes we've addressed earlier, so I'd make sure that the PR is rebased first.

But just to compare, I've re-triggered those QQ tests.

LoisSotoLopez · 2026-01-08T10:07:02Z

Done, couldn't identify it from the CI tests, but it was the grow_queue test in quorum_queue_SUITE the one failing. Maybe message_ttl_policy too, but that was flaky and ended up succeeding (not sure if it was flaky or it ended up succeeding because of my changes to fix the other test case).

LoisSotoLopez · 2026-01-08T11:12:11Z

Mmmm, tests pass locally even when running with the same command as ci make -C deps/rabbit ct-quorum_queue RABBITMQ_METADATA_STORE=khepri NON_DETERMINISTIC=1. Not sure how to identify which one is failing from the CI, but maybe you have some idea of what the cause of the inconsistency might be.

michaelklishin · 2026-01-08T21:37:08Z

@LoisSotoLopez not sure if you can see the CT logs archive in the Actions build but the gmake ct-quorum_queue failures are:

cluster_size_3:repair_metadata_nodes_list_to_map
cluster_size_3:repair_metadata_nodes_added_member
cluster_size_3:repair_metadata_nodes_removed_member
cluster_size_3:repair_metadata_nodes_added_removed_member

Here is one example:

=== Location: [{erpc,call,1380},
              {quorum_queue_SUITE,repair_metadata_nodes,1628},
              {test_server,ts_tc,1796},
              {test_server,run_test_case_eval1,1305},
              {test_server,run_test_case_eval,1237}]

=== Reason: {exception,
                 {badmap,
                     ['rmq-ct-cluster_size_3-1-21162@localhost',
                      'rmq-ct-cluster_size_3-2-21216@localhost',
                      'rmq-ct-cluster_size_3-3-21270@localhost']},
                 [{maps,keys,
                      [['rmq-ct-cluster_size_3-1-21162@localhost',
                        'rmq-ct-cluster_size_3-2-21216@localhost',
                        'rmq-ct-cluster_size_3-3-21270@localhost']],
                      [{error_info,#{module => erl_stdlib_errors}}]},
                  {quorum_queue_SUITE,
                      '-repair_metadata_nodes_list_to_map/1-fun-0-',1,
                      [{file,"test/quorum_queue_SUITE.erl"},{line,1585}]},
                  {rabbit_db_queue,update_in_khepri,3,
                      [{file,"src/rabbit_db_queue.erl"},{line,677}]},
                  {rabbit_amqqueue,update,2,[]}]}

Perhaps you want to call rabbit_data_coercion:to_map/1 on some of the values in the function in question.

LoisSotoLopez · 2026-01-09T10:45:53Z

Last test failures seem unrelated:

Test plugins (rabbitmq_shovel)... : seen errors like this Error: Reason: {thrown,{timeout_waiting_for_deliver1,[...]}} on CI logs and the following one in rmq server logs

2026-01-09 09:13:43.325473+00:00 [error] <0.746.0> Channel error on connection <0.647.0> (127.0.0.1:48722 -> 127.0.0.1:21000, vhost: '/', user: 'guest'), channel 1:
2026-01-09 09:13:43.325473+00:00 [error] <0.746.0> operation basic.publish caused a channel exception not_found: no exchange 'test_exchange' in vhost '/'

Test plugins(rabbitmq_trust_store)... : this error on the ci logs

fatal: unable to access 'https://github.com/rabbitmq/inet_tcp_proxy/': Failed to connect to github.com port 443 after 133427 ms: Couldn't connect to server

not sure if you can see the CT logs archive in the Actions build but ...

@michaelklishin yep, I can. Missed the zip files in the Actions Summary page for the PR.

michaelklishin · 2026-01-09T20:29:24Z

Yeah, we again have accumulated a few flakes in main. I have restarted those failed jobs.

The Selenium one requires secret access that external contributors do not have.

Co-authored-by: Péter Gömöri <gomoripeti@users.noreply.github.com>

I thought that asserting the right membership status on all nodes would guarantee that the other status fields would be the expected ones but it seems like that assumption was wrong.

- Only query node uids via RPC for missing members - Update nodes in state to map if FF is enabled - Properly return whether queue repaired or not Co-authored-by: Péter Gömöri <gomoripeti@users.noreply.github.com>

- data dir should only be deleted if RaUid is not undefined (and it does not match the UID in metadata store) - does not need to log separately if RaUid is undefined as that is already logged when restart_server returns with name_not_registered error

LoisSotoLopez · 2026-01-29T09:21:14Z

Regarding the latest tests failing:

For the `ct-quorum_queue` error:

which happened on force_shrink_member_to_current_member testcase of quorum_queue_SUITE, it seems like when re-growing the queue after shrinking, and successfully detecting that the number of replicas is reduced to 1, the queue was not able to grow because non-voter nodes were detected.

I would expect non-voter nodes to be present, particularly the node where we shrinked if it has enough time to become leader.

Several points in this regard:

The node where we just shrinked is expected to be leader by the time we grow the queue, since we wait for messages to be ready which feels to me should be time enough to be come leader.

            rabbit_ct_broker_helpers:rpc(Config, 0, rabbit_quorum_queue,
                                         force_shrink_member_to_current_member, [<<"/">>, QQ]),

            wait_for_messages_ready([Server0], RaName, 3),
            queue_utils:assert_number_of_replicas(
              Config, Server0, <<"/">>, QQ, 1),

            %% grow queues back to all nodes
            [rpc:call(Server0, rabbit_quorum_queue, grow, [S, <<"/">>, <<".*">>, all]) || S <- [Server1, Server2]],

The logic behind getting nodes changed slightly but we are just taking the list of nodes from the queue state, so I don't think that has affected the test results.

@michaelklishin do you think this has been just a race condition? I ran the tests locally and passed all green so...

Edit: k, maybe those two were flaky ones

kjnilsson self-requested a review July 16, 2025 16:42

LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from dac1a44 to 72a48e9 Compare July 17, 2025 10:29

kjnilsson mentioned this pull request Jul 17, 2025

rabbitmq-queues: a command to display a member with highest (commit, log, snapshot) index #14237

Closed

12 tasks

kjnilsson requested changes Jul 21, 2025

View reviewed changes

gomoripeti mentioned this pull request Jul 24, 2025

Classic queues: return basic info items without calling queue process #14280

Merged

12 tasks

mergify bot mentioned this pull request Jul 24, 2025

Classic queues: return basic info items without calling queue process (backport #14280) #14281

Merged

12 tasks

LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from 1074165 to 35ef780 Compare August 1, 2025 09:50

LoisSotoLopez commented Aug 5, 2025

View reviewed changes

deps/rabbit/src/rabbit_quorum_queue.erl Show resolved Hide resolved

LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from f48962d to 5a5877a Compare August 13, 2025 12:13

LoisSotoLopez requested a review from kjnilsson September 11, 2025 12:19

LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch 2 times, most recently from 95cabda to 3293a24 Compare September 29, 2025 13:06

michaelklishin added this to the 4.3.0 milestone Sep 29, 2025

kjnilsson requested changes Sep 30, 2025

View reviewed changes

LoisSotoLopez requested a review from kjnilsson October 1, 2025 11:35

LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch 4 times, most recently from e92b694 to 27082d8 Compare January 7, 2026 11:29

LoisSotoLopez requested a review from kjnilsson January 7, 2026 15:43

LoisSotoLopez and others added 14 commits January 28, 2026 13:11

Reconcile QQ node dead during delete and redeclare

9d9da59

Co-authored-by: Péter Gömöri <gomoripeti@users.noreply.github.com>

Apply PR suggestions

9f55b7c

Fix call to get_nodes

5501bcc

Handle queue returning no info on members info

f26a1b4

Show status key to compare in tests

389dedb

Use awaitMatch on QQ status fields too

1c2b75c

I thought that asserting the right membership status on all nodes would guarantee that the other status fields would be the expected ones but it seems like that assumption was wrong.

Properly line up map arrow

5671891

Warn on undefined remote uid while recovering

645a486

rabbit_quorum_queue: Remove duplicate word "queue" from logs

55cdba1

Refactor repair_amqqueue_nodes

df613e6

- Only query node uids via RPC for missing members - Update nodes in state to map if FF is enabled - Properly return whether queue repaired or not Co-authored-by: Péter Gömöri <gomoripeti@users.noreply.github.com>

Refactor case in recover

1865aaf

- data dir should only be deleted if RaUid is not undefined (and it does not match the UID in metadata store) - does not need to log separately if RaUid is undefined as that is already logged when restart_server returns with name_not_registered error

Properly set nodes in Q state when shrinking

9e37dd1

Expect Nodes being a list in tests for mixed ones

a50874c

Skip repairing QQ nodes metadata if ff not enabled

768e79c

LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from 0efdbd5 to 768e79c Compare January 28, 2026 12:12

Stop using amqqueue:get_nodes & handle unused var

3caeaa5

Add temporary stacktrace for debugging

b66c481

Reconcile QQ node dead during delete and redeclare #14241

Are you sure you want to change the base?

Reconcile QQ node dead during delete and redeclare #14241

Conversation

LoisSotoLopez commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Types of Changes

Checklist

Further Comments

Uh oh!

kjnilsson commented Jul 16, 2025

Uh oh!

kjnilsson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LoisSotoLopez commented Jul 25, 2025

Uh oh!

michaelklishin commented Jul 28, 2025

Uh oh!

LoisSotoLopez commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

michaelklishin commented Aug 5, 2025

Uh oh!

LoisSotoLopez commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LoisSotoLopez commented Aug 13, 2025

Uh oh!

michaelklishin commented Aug 13, 2025

Uh oh!

LoisSotoLopez commented Aug 13, 2025

Uh oh!

michaelklishin commented Aug 13, 2025

Uh oh!

LoisSotoLopez commented Sep 29, 2025

Uh oh!

michaelklishin commented Sep 29, 2025

Uh oh!

michaelklishin commented Sep 29, 2025

Uh oh!

gomoripeti commented Sep 29, 2025

Uh oh!

kjnilsson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelklishin commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kjnilsson commented Jan 7, 2026

Uh oh!

michaelklishin commented Jan 7, 2026

Uh oh!

LoisSotoLopez commented Jan 8, 2026

Uh oh!

LoisSotoLopez commented Jan 8, 2026

Uh oh!

michaelklishin commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LoisSotoLopez commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelklishin commented Jan 9, 2026

Uh oh!

LoisSotoLopez commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

LoisSotoLopez commented Jul 16, 2025 •

edited

Loading

kjnilsson left a comment •

edited

Loading

LoisSotoLopez commented Aug 1, 2025 •

edited

Loading

LoisSotoLopez commented Aug 6, 2025 •

edited

Loading

kjnilsson left a comment •

edited

Loading

michaelklishin commented Oct 1, 2025 •

edited

Loading

michaelklishin commented Jan 8, 2026 •

edited

Loading

LoisSotoLopez commented Jan 9, 2026 •

edited

Loading

LoisSotoLopez commented Jan 29, 2026 •

edited

Loading

For the `ct-quorum_queue` error: