Skip to content

Conversation

@LoisSotoLopez
Copy link
Contributor

@LoisSotoLopez LoisSotoLopez commented Jul 16, 2025

Proposed Changes

This PR implements the suggested solution for the issue described in discussion #13131

Currently, when a QQ is deleted and re-declared while one of its nodes is dead, this dead node won't be able to reconcile with the new queue.

In this PR we add the list of ra UIds for the cluster to each node queue record, so that when a Rabbit node recovers a queue it will be able to detect the situation described above and properly reconcile .

Types of Changes

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

@kjnilsson
Copy link
Contributor

thanks @LoisSotoLopez - this looks like it will do what we discussed a while back. I feel unsure about adding another key to the queue type state for this, mainly becuause we'd have to keep uids and nodes in sync. It would be nicer if nodes turned from a list into a map although even this is a bit controversial and could become a source of bugs. Let me consider it for a day or two.

Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having two keys with similar information (nodes) that will need to be kept in sync. I think we need to move the current nodes list value to the the #{node() => uid()} map format and handle the two formats in the relevant places, mostly in rabbit_queue_type. we do need a list of the member nodes in a few places but we could add a convenience functio: rabbit_queue_type:nodes/1 that takes a queue record and returns a list of member nodes. internally it could just call rabbit_queue_type:info(Q, [members]) and extract the result from that then update all places where we explicity use get_type_state to extract the member nodes.

In addition I think we need to put the use of nodes as a map behind a feature flag to avoid new queue records with nodes map values being created in a mixed versions cluster.

@LoisSotoLopez
Copy link
Contributor Author

@kjnilsson Thanks for the suggestions. Just wanted to let you know we are working on this. Had some incident we had to take care of this week but I'll be pushing this PR forward next week.

@michaelklishin
Copy link
Collaborator

@LoisSotoLopez we have to ask your employer to sign the Broadcom CLA before we can accept this contribution (or its future finished version).

It is about one page long, nothing particularly unusual or onerous.

@LoisSotoLopez
Copy link
Contributor Author

LoisSotoLopez commented Aug 1, 2025

That commit below is just for showing current progress. Have been struggling to understand why a few of the remaining tests fail.

@LoisSotoLopez LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from 1074165 to 35ef780 Compare August 1, 2025 09:50
@michaelklishin
Copy link
Collaborator

@LoisSotoLopez sorry, any updates or feedback on the new CLA? We cannot accept any contributions without a signed CLA at the moment, and this change won't qualify for a "trivial" one, even if those get exempt in the future.

We are trying to make it digital one way or another but there's a risk that the process will stay what it currently is (just a document to sign and email).

@LoisSotoLopez
Copy link
Contributor Author

LoisSotoLopez commented Aug 6, 2025

@LoisSotoLopez sorry, any updates or feedback on the new CLA? We cannot accept any contributions without a signed CLA at the moment, and this change won't qualify for a "trivial" one, even if those get exempt in the future.

We are trying to make it digital one way or another but there's a risk that the process will stay what it currently is (just a document to sign and email).

Yes, sorry about not having this already addressed. We are on summer vacations right now so the people who will be signing it for the whole company won't be able to do it until next week. I would do it myself but can't due to a intellectual property clause on my contract that wasn't there the last time I signed the CLA.

Will try to get it signed asap.

@LoisSotoLopez LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from f48962d to 5a5877a Compare August 13, 2025 12:13
@LoisSotoLopez
Copy link
Contributor Author

Hi! The CLA should be signed now. This should be ready to be reviewed again.

@michaelklishin
Copy link
Collaborator

@LoisSotoLopez I'm afraid I do not see any inbound emails with the signed document. Perhaps it was not sent to teamrabbitmq <@> gmail, as the CLA repo suggests but rather straight to opensource <@> broadcom.

@LoisSotoLopez
Copy link
Contributor Author

@LoisSotoLopez I'm afraid I do not see any inbound emails with the signed document. Perhaps it was not sent to teamrabbitmq <@> gmail, as the CLA repo suggests but rather straight to opensource <@> broadcom.

Should be there now. 🫡

@michaelklishin
Copy link
Collaborator

@LoisSotoLopez thank you, we should be all set in terms of the CLA.

I'll defer merging to @kjnilsson. That can take another week or two.

@LoisSotoLopez LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch 2 times, most recently from 95cabda to 3293a24 Compare September 29, 2025 13:06
@LoisSotoLopez
Copy link
Contributor Author

Just rebased and squashed this PR. Let me know if there's anything new I can do to move it forward.

@michaelklishin
Copy link
Collaborator

@kjnilsson

@michaelklishin michaelklishin added this to the 4.3.0 milestone Sep 29, 2025
@michaelklishin
Copy link
Collaborator

@LoisSotoLopez given that this changes how member nodes are represented in the queue state, it is too late for us to get it into 4.2.0 and the earliest this PR can be considered is for 4.3.0, the work on which will begin in earnest in November.

@gomoripeti
Copy link
Contributor

It is responsible to not rush something released that changes the queue record.

Because 4.3 is 6 months away I wonder if it would have any benefit to still include that part of the PR in 4.2 which moves get_name from amqqueue to rabbit_amqqueue module. That change is just a refactoring but touches multiple modules while the problematic part is restricted to rabbit_quorum_queue. I'm suggesting this because maybe it avoids some conflicts and makes it easier to backport other changes to 4.2 when the whole of this PR is merged into main but not into 4.2.x. Or if the whole of this PR is not merged for a long time it would make it easier to rebase it to a future main.
Just an idea, what do you think?

Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more changes, this feature is a bit trickier than I had initially expected. I am happy take over if you feel like we're going back and forth too much. Just let me know.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Oct 1, 2025

Test failures are not flakes. They key part is:

{badmatch,[]}

[{rabbit_queue_type,get_nodes,1,[{file,\"rabbit_queue_type.erl\"}, {line,428}]},

{rabbit_amqp_management,queue_topology,1, [{file,\"rabbit_amqp_management.erl\"},{line,474}]},\n 

{rabbit_amqp_management,encode_queue,3, [{file,\"rabbit_amqp_management.erl\"},{line,438}]}

@LoisSotoLopez LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch 4 times, most recently from e92b694 to 27082d8 Compare January 7, 2026 11:29
@kjnilsson
Copy link
Contributor

@LoisSotoLopez there are 5 failing tests according to ci - please can you take a look?

@michaelklishin
Copy link
Collaborator

Those can be flakes we've addressed earlier, so I'd make sure that the PR is rebased first.

But just to compare, I've re-triggered those QQ tests.

@LoisSotoLopez
Copy link
Contributor Author

Done, couldn't identify it from the CI tests, but it was the grow_queue test in quorum_queue_SUITE the one failing. Maybe message_ttl_policy too, but that was flaky and ended up succeeding (not sure if it was flaky or it ended up succeeding because of my changes to fix the other test case).

@LoisSotoLopez
Copy link
Contributor Author

Mmmm, tests pass locally even when running with the same command as ci make -C deps/rabbit ct-quorum_queue RABBITMQ_METADATA_STORE=khepri NON_DETERMINISTIC=1. Not sure how to identify which one is failing from the CI, but maybe you have some idea of what the cause of the inconsistency might be.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Jan 8, 2026

@LoisSotoLopez not sure if you can see the CT logs archive in the Actions build but the gmake ct-quorum_queue failures are:

  • cluster_size_3:repair_metadata_nodes_list_to_map
  • cluster_size_3:repair_metadata_nodes_added_member
  • cluster_size_3:repair_metadata_nodes_removed_member
  • cluster_size_3:repair_metadata_nodes_added_removed_member

Here is one example:

=== Location: [{erpc,call,1380},
              {quorum_queue_SUITE,repair_metadata_nodes,1628},
              {test_server,ts_tc,1796},
              {test_server,run_test_case_eval1,1305},
              {test_server,run_test_case_eval,1237}]

=== Reason: {exception,
                 {badmap,
                     ['rmq-ct-cluster_size_3-1-21162@localhost',
                      'rmq-ct-cluster_size_3-2-21216@localhost',
                      'rmq-ct-cluster_size_3-3-21270@localhost']},
                 [{maps,keys,
                      [['rmq-ct-cluster_size_3-1-21162@localhost',
                        'rmq-ct-cluster_size_3-2-21216@localhost',
                        'rmq-ct-cluster_size_3-3-21270@localhost']],
                      [{error_info,#{module => erl_stdlib_errors}}]},
                  {quorum_queue_SUITE,
                      '-repair_metadata_nodes_list_to_map/1-fun-0-',1,
                      [{file,"test/quorum_queue_SUITE.erl"},{line,1585}]},
                  {rabbit_db_queue,update_in_khepri,3,
                      [{file,"src/rabbit_db_queue.erl"},{line,677}]},
                  {rabbit_amqqueue,update,2,[]}]}

Perhaps you want to call rabbit_data_coercion:to_map/1 on some of the values in the function in question.

@LoisSotoLopez
Copy link
Contributor Author

LoisSotoLopez commented Jan 9, 2026

Last test failures seem unrelated:

  • Test plugins (rabbitmq_shovel)... : seen errors like this Error: Reason: {thrown,{timeout_waiting_for_deliver1,[...]}} on CI logs and the following one in rmq server logs
2026-01-09 09:13:43.325473+00:00 [error] <0.746.0> Channel error on connection <0.647.0> (127.0.0.1:48722 -> 127.0.0.1:21000, vhost: '/', user: 'guest'), channel 1:
2026-01-09 09:13:43.325473+00:00 [error] <0.746.0> operation basic.publish caused a channel exception not_found: no exchange 'test_exchange' in vhost '/'
  • Test plugins(rabbitmq_trust_store)... : this error on the ci logs
fatal: unable to access 'https://github.com/rabbitmq/inet_tcp_proxy/': Failed to connect to github.com port 443 after 133427 ms: Couldn't connect to server

not sure if you can see the CT logs archive in the Actions build but ...

@michaelklishin yep, I can. Missed the zip files in the Actions Summary page for the PR.

@michaelklishin
Copy link
Collaborator

Yeah, we again have accumulated a few flakes in main. I have restarted those failed jobs.

The Selenium one requires secret access that external contributors do not have.

LoisSotoLopez and others added 14 commits January 28, 2026 13:11
Co-authored-by: Péter Gömöri <gomoripeti@users.noreply.github.com>
I thought that asserting the right membership status on all nodes would
guarantee that the other status fields would be the expected ones but it
seems like that assumption was wrong.
- Only query node uids via RPC for missing members
- Update nodes in state to map if FF is enabled
- Properly return whether queue repaired or not

Co-authored-by: Péter Gömöri <gomoripeti@users.noreply.github.com>
- data dir should only be deleted if RaUid is not undefined
  (and it does not match the UID in metadata store)
- does not need to log separately if RaUid is undefined as that is
  already logged when restart_server returns with name_not_registered
  error
@LoisSotoLopez LoisSotoLopez force-pushed the qq_uuid_in_metadata_store branch from 0efdbd5 to 768e79c Compare January 28, 2026 12:12
@LoisSotoLopez
Copy link
Contributor Author

LoisSotoLopez commented Jan 29, 2026

Regarding the latest tests failing:

For the ct-quorum_queue error:

which happened on force_shrink_member_to_current_member testcase of quorum_queue_SUITE, it seems like when re-growing the queue after shrinking, and successfully detecting that the number of replicas is reduced to 1, the queue was not able to grow because non-voter nodes were detected.

I would expect non-voter nodes to be present, particularly the node where we shrinked if it has enough time to become leader.

Several points in this regard:

  • The node where we just shrinked is expected to be leader by the time we grow the queue, since we wait for messages to be ready which feels to me should be time enough to be come leader.
            rabbit_ct_broker_helpers:rpc(Config, 0, rabbit_quorum_queue,
                                         force_shrink_member_to_current_member, [<<"/">>, QQ]),

            wait_for_messages_ready([Server0], RaName, 3),
            queue_utils:assert_number_of_replicas(
              Config, Server0, <<"/">>, QQ, 1),

            %% grow queues back to all nodes
            [rpc:call(Server0, rabbit_quorum_queue, grow, [S, <<"/">>, <<".*">>, all]) || S <- [Server1, Server2]],
  • The logic behind getting nodes changed slightly but we are just taking the list of nodes from the queue state, so I don't think that has affected the test results.

@michaelklishin do you think this has been just a race condition? I ran the tests locally and passed all green so...

Edit: k, maybe those two were flaky ones

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants