User & Operator Roles #7028

illotum · 2023-01-24T20:45:46Z

illotum
Jan 24, 2023

Posting here to avoid sliding into obscurity of Slack history:

Many companies choose to run a dedicated team of operators, who only provision and maintain the infrastructure. Operators are supposed to guarantee broker’s stability and install guardrails for the safe use, but they rarely have the knowledge about how provisioned resources are consumed. Within our infrastructure, end-users have access only to the HTTP API, and operators to CLI and configuration.

The operators role, and responsibilities it must bear (and take away from the users) complicates several aspects of managing RabbitMQ for us. For example, adoption of the quorum queues. Namely:

Users cannot grow or shrink queues over HTTP API.
There are no safeguards stopping from shrinking queues below the recommended minimum.
There are no automated mechanisms to recover broker to the desired replication factor upon node replacement.

We are looking for an opinion and approval about two things: 1/ adding the safeguards and mechanisms for the QQs specifically, and 2/ cleanly separating the cloud operator role in some way. The latter is a way more open-ended problem, and perhaps will be the bulk of our discussion. I see it being achievable by dedicating the ‘admin’ role within broker to the cloud operators, or by hoisting all operator responsibilities to be set directly from the config file.

Answered by kjnilsson

Jan 25, 2023

Users cannot grow or shrink queues over HTTP API.

This is a reasonable feature. PRs welcome. :)

There are no safeguards stopping from shrinking queues below the recommended minimum.

I think operators need to be able to shrink queues below any minimum to ensure they can recover after "suprising" node removal events. E.g. in a 3 node cluster with a 3 node minimum you wouldn't be able to shrink down to two if a node disappears never to come back. Unless you mean a minimum of 1 which is reasonable.

There are no automated mechanisms to recover broker to the desired replication factor upon node replacement.

@michaelklishin and I have discussed some options to do shrink and grow work as pa…

View full answer

kjnilsson · 2023-01-25T09:08:11Z

kjnilsson
Jan 25, 2023
Maintainer

Users cannot grow or shrink queues over HTTP API.

This is a reasonable feature. PRs welcome. :)

There are no safeguards stopping from shrinking queues below the recommended minimum.

I think operators need to be able to shrink queues below any minimum to ensure they can recover after "suprising" node removal events. E.g. in a 3 node cluster with a 3 node minimum you wouldn't be able to shrink down to two if a node disappears never to come back. Unless you mean a minimum of 1 which is reasonable.

There are no automated mechanisms to recover broker to the desired replication factor upon node replacement.

@michaelklishin and I have discussed some options to do shrink and grow work as part of node addition / removal. So far we've focussed on asynchronous options as we already have a Ra machine (the stream coordinator) that grows and shrinks automatically by checking periodically if rabbit nodes have been removed and/or added.

Another tricky aspect of this is how we'd decided which queues should grow and into which nodes. For a 3 node cluster this is trivial of course but a larger cluster where QQs don't span all nodes then we'd need additional heuristics to make membership change decisions.

0 replies

michaelklishin · 2023-01-25T18:20:30Z

michaelklishin
Jan 25, 2023
Maintainer

I doubt there would be much opposition to 1 but it's not clear to me what specific suggestions you may already have.

As for 2, it's a fine line to draw because not everyone runs RabbitMQ as a service. However, exposing a fraction of operator-oriented features only via rabbitmq.conf and CLI tools and not the HTTP API makes sense.

0 replies

illotum · 2023-01-25T18:47:40Z

illotum
Jan 25, 2023
Author

Happy to hear you're already thinking about it, @kjnilsson! I'll amend that yeah, primary concert is declaring queues below the recommended safe minimum, shrinking we can be more lenient on. But otherwise we'll proceed to working on it.

We had ideas about exposing AZ (or rather generic locality tagging) to the growth algorithm, if you or Michael give us hints on how to plug it into Ra, we'd be happy to help.

And for the last @michaelklishin, agree on both accounts. The change has to be opt-in, and hopefully granular. My recent work on default operator policies was a part of this, and be looking into adding a switch to disable access to operator policies via HTTP.

0 replies

SimonUnge · 2023-01-25T19:29:27Z

SimonUnge
Jan 25, 2023
Maintainer

@michaelklishin @kjnilsson Thank you such quick feedback.

As Alex said, we have ideas about availability zone tagging as one mechanism to control on which nodes a queue should grow towards. I started playing with the code, for automatic 'growth', with a naive solution that works fairly fine, but as pointed out, logic for load-balancing queue membership might be difficult to do per node addition - and a periodic check solution would perhaps be the way to go?

One thing that struck me while looking into the current logic, was that rabbitmq does not really keep track of the queue group size, other than storing the header of x-quorum-initial-group-size, and in the queue state a list of the current nodes (but does not take into account if the nodes are online or not).

To get some kind of restore/automation mechanism for a queue to reach its desired member size, do you agree that rabbitmq should store some kind of 'desired' queue group size, a number that would increase/decrease whenever grow/shrink is issued for a queue?

0 replies

michaelklishin · 2023-01-25T19:39:06Z

michaelklishin
Jan 25, 2023
Maintainer

A certain mechanism for candidate node/replica selection is definitely needed. It would have to be a good fit for Ra but otherwise, I think there's quite a bit of support for some solution in this area on the core team.

0 replies

michaelklishin · 2023-01-25T19:40:11Z

michaelklishin
Jan 25, 2023
Maintainer

Target QQ replica count is another thing we have discussed in the past. It makes sense to me. This can be configured cluster-wide, for example, or using a policy (this would make things both more dynamic and easier to adapt).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

User & Operator Roles #7028

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

User & Operator Roles #7028

Uh oh!

illotum Jan 24, 2023

Replies: 6 comments

Uh oh!

Uh oh!

kjnilsson Jan 25, 2023 Maintainer

Uh oh!

michaelklishin Jan 25, 2023 Maintainer

Uh oh!

illotum Jan 25, 2023 Author

Uh oh!

SimonUnge Jan 25, 2023 Maintainer

Uh oh!

michaelklishin Jan 25, 2023 Maintainer

Uh oh!

Uh oh!

michaelklishin Jan 25, 2023 Maintainer

illotum
Jan 24, 2023

kjnilsson
Jan 25, 2023
Maintainer

michaelklishin
Jan 25, 2023
Maintainer

illotum
Jan 25, 2023
Author

SimonUnge
Jan 25, 2023
Maintainer

michaelklishin
Jan 25, 2023
Maintainer

michaelklishin
Jan 25, 2023
Maintainer