-
Posting here to avoid sliding into obscurity of Slack history: Many companies choose to run a dedicated team of operators, who only provision and maintain the infrastructure. Operators are supposed to guarantee broker’s stability and install guardrails for the safe use, but they rarely have the knowledge about how provisioned resources are consumed. Within our infrastructure, end-users have access only to the HTTP API, and operators to CLI and configuration. The operators role, and responsibilities it must bear (and take away from the users) complicates several aspects of managing RabbitMQ for us. For example, adoption of the quorum queues. Namely:
We are looking for an opinion and approval about two things: 1/ adding the safeguards and mechanisms for the QQs specifically, and 2/ cleanly separating the cloud operator role in some way. The latter is a way more open-ended problem, and perhaps will be the bulk of our discussion. I see it being achievable by dedicating the ‘admin’ role within broker to the cloud operators, or by hoisting all operator responsibilities to be set directly from the config file. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
This is a reasonable feature. PRs welcome. :)
I think operators need to be able to shrink queues below any minimum to ensure they can recover after "suprising" node removal events. E.g. in a 3 node cluster with a 3 node minimum you wouldn't be able to shrink down to two if a node disappears never to come back. Unless you mean a minimum of 1 which is reasonable.
@michaelklishin and I have discussed some options to do shrink and grow work as part of node addition / removal. So far we've focussed on asynchronous options as we already have a Ra machine (the stream coordinator) that grows and shrinks automatically by checking periodically if rabbit nodes have been removed and/or added. Another tricky aspect of this is how we'd decided which queues should grow and into which nodes. For a 3 node cluster this is trivial of course but a larger cluster where QQs don't span all nodes then we'd need additional heuristics to make membership change decisions. |
Beta Was this translation helpful? Give feedback.
-
I doubt there would be much opposition to 1 but it's not clear to me what specific suggestions you may already have. As for 2, it's a fine line to draw because not everyone runs RabbitMQ as a service. However, exposing a fraction of operator-oriented features only via |
Beta Was this translation helpful? Give feedback.
-
Happy to hear you're already thinking about it, @kjnilsson! I'll amend that yeah, primary concert is declaring queues below the recommended safe minimum, shrinking we can be more lenient on. But otherwise we'll proceed to working on it. We had ideas about exposing AZ (or rather generic locality tagging) to the growth algorithm, if you or Michael give us hints on how to plug it into Ra, we'd be happy to help. And for the last @michaelklishin, agree on both accounts. The change has to be opt-in, and hopefully granular. My recent work on default operator policies was a part of this, and be looking into adding a switch to disable access to operator policies via HTTP. |
Beta Was this translation helpful? Give feedback.
-
@michaelklishin @kjnilsson Thank you such quick feedback. As Alex said, we have ideas about availability zone tagging as one mechanism to control on which nodes a queue should grow towards. I started playing with the code, for automatic 'growth', with a naive solution that works fairly fine, but as pointed out, logic for load-balancing queue membership might be difficult to do per node addition - and a periodic check solution would perhaps be the way to go? One thing that struck me while looking into the current logic, was that rabbitmq does not really keep track of the queue group size, other than storing the header of x-quorum-initial-group-size, and in the queue state a list of the current nodes (but does not take into account if the nodes are online or not). To get some kind of restore/automation mechanism for a queue to reach its desired member size, do you agree that rabbitmq should store some kind of 'desired' queue group size, a number that would increase/decrease whenever grow/shrink is issued for a queue? |
Beta Was this translation helpful? Give feedback.
-
A certain mechanism for candidate node/replica selection is definitely needed. It would have to be a good fit for Ra but otherwise, I think there's quite a bit of support for some solution in this area on the core team. |
Beta Was this translation helpful? Give feedback.
-
Target QQ replica count is another thing we have discussed in the past. It makes sense to me. This can be configured cluster-wide, for example, or using a policy (this would make things both more dynamic and easier to adapt). |
Beta Was this translation helpful? Give feedback.
This is a reasonable feature. PRs welcome. :)
I think operators need to be able to shrink queues below any minimum to ensure they can recover after "suprising" node removal events. E.g. in a 3 node cluster with a 3 node minimum you wouldn't be able to shrink down to two if a node disappears never to come back. Unless you mean a minimum of 1 which is reasonable.
@michaelklishin and I have discussed some options to do shrink and grow work as pa…