Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

ShubhamRwt · 2025-07-14T11:45:21Z

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal #145. This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Signed-off-by: ShubhamRwt <[email protected]>

106-self-healing-feature-in-operator.md

tomncooper

I did a first pass. I think this is a better proposal which is more in line with how Strimzi currently uses CC.

I think you need more detail on the interaction with the current auto-rebalancing and also a clearer description of the FSM states and their transitions. I found it hard to follow the sequence you are proposing.

For the notifier, I actually think we should stop users using custom notifiers (we could make it conditional on the full mode being set or not). As we are creating K8s resources in response to detected anomalies users can create alerting based on that if they need it. If users do need that then we could provide implementations of the various notifiers which extend our notifier rather than the CC one.

106-self-healing-feature-in-operator.md

scholzj

I think this is going in the right direction. But I think it needs to go a bit deeper:

We need to establish our own terminology and not take over the Cruise Control one. There is not really any self-healing and most of the anomalies are not really anomalies.
If I read this proposal right, you want to focus on when the cluster is out-of-balance. That is a great start. But perhaps that should not be called mode: full? Calling it full seems confusing - does it mean that full includes scale-up / scale-down? Also, I guess in the future we would add some actual self-healing to handle the broken disks or brokers. That might create additional modes probably. So maybe mode: rebalance or mode: skew or something like that would make more sense?

106-self-healing-feature-in-operator.md

ppatierno · 2025-07-17T08:37:54Z

@scholzj good to know that you like the track we are now :-)

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

scholzj · 2025-07-17T16:22:17Z

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

I do not think this works here. KafkaRebalance is essentially an imperative API (although implemented through a declarative resource). You are sending a command to the CO to do a full rebalance.

The autoRebalance section in the Kafka CR is a declarative API. You are declaring how CO should automatically react to some situations. add-brokers and remove-brokers works well in both as it is a command as well as event description. full IMHO does not work that well in the declarative mode because as I said, it can be easily interpreted as full == all available options (i.e. including scale-up or scale-down). That is where the idea of skew comes from as from my understanding in this proposal we are reacting to skew -> the skew can be a CPU inbalance, Disk inbalance etc.

goal-violation sounds reasonable ... but I wonder if it is too generic. I assume that the future modes ... e.g. CCs suggestion to scale-up, scale-down, bad distribution across racks, broken disks or brokers ... those are also goal violations, or? But you cannot solve these by creating a KafkaRebalance. So they will need their own modes as well. That is kind of the context in whcih I'm trying to see the mode names.

Signed-off-by: ShubhamRwt <[email protected]>

106-auto-rebalance-on-imbalanced-clusters.md

tomncooper

Ok I did another pass. I have a few questions:

How are you going to distinguish anomaly CMs from different Kafka clusters in the same namespace. I know it is not recommened, but user do deploy multiple Kafka clusters in the same NS.
You need to deal with GC'ing all these anomaly CMs in the case where a rebalance is on going. Do you delete them? Do you have some kind of timeout based on the detection interval?
It is not clear what you mean by scale up/down auto-rebalances being queued up? I assume you mean generated KafkaRebalance CRs? But it is not clear.

106-auto-rebalance-on-imbalanced-clusters.md

nickgarvey · 2025-08-18T18:42:54Z

Chiming in as an end user - glad to see this proposal! We have been debating internally if we want to have a cronjob to issue rebalances, this is a lot better. In particular the model of using CruiseControl's anomaly detection while issuing the rebalances through KafkaRebalance CRs seems like it will fit perfect into our workflows.

I see discussion on how to represent the anomalies. Any solution here is fine for us, I envision we will mostly be interacting with the KafkaRebalance CR and not much with anything else.

An area that could be explicit is the right way to stop all rebalances and not issue any more. Rebalance operations often saturate bandwidth, either disk or network, and cause major latency during producing. We often find ourselves needing to cancel them as we scale and learn our limits. It looks like we might be able to delete mode: skew on the CruiseControl CR to stop automatic rebalances, but it could be more clear.

Thanks for putting this together, excited to see this.

ppatierno · 2025-08-19T11:32:42Z

@nickgarvey Thanks for the feedback! Usually you are able to stop the current rebalancing by applying the stop annotation on the KafkaRebalance (of course the current batch has to finish first). With auto-rebalancing, the KafkaRebalance is owned by the operator and not by the user. That's anyway a good feedback because there is no clear way for the user to stop an auto-rebalancing in progress. I think you could apply the stop annotation on the KafkaRebalance resource but you can't delete it due to a finalizer. Then you should delete the corresponding mode within the spec.cruiseControl.autoRebalance.mode field to avoid the re-triggering. It's something to think about.

106-auto-rebalance-on-imbalanced-clusters.md

Signed-off-by: ShubhamRwt <[email protected]>

tomncooper

I have had another pass, mostly just suggestions around language and grammar.

However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.

106-auto-rebalance-on-imbalanced-clusters.md

tomncooper

I have had another pass, mostly just suggestions around language and grammar.

However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt · 2025-11-07T11:27:58Z

Hello @scholzj @ppatierno @tinaselenge @kyguy @tomncooper @katheris @im-konge @see-quick @Frawless . I have addressed and resolved all the comments on the proposal. The proposal is ready for review again. It would be great if you can find some time to have another pass.

tomncooper

The section on the StrimziCruiseControlNotifier, particularly the CM format needs review. It is not clear what the unfixable anomalies/goal list is actually going to contain? You also need to make clear what happens if all the detected anomalies are unfixable?

106-auto-rebalance-on-imbalanced-clusters.md

tinaselenge

Thank you for the update Shubham. It's been a while since the last time I read the proposal, and can see quite a lot changed. I like the proposed solution to use the auto-rebalancing and let the operator handle the healing.

Overall I'm happy with the proposal, but I think there are a few parts that need to be ironed out with more clarity.

106-auto-rebalance-on-imbalanced-clusters.md

Signed-off-by: ShubhamRwt <[email protected]>

katheris

Thanks @ShubhamRwt. I've added some formatting suggestions and a couple of clarifying questions.

The only other area I wondered about was whether we should consider making use of the maintenance windows? So for example not triggering a rebalance unless we are within a maintenance window. What do you think @scholzj ?

106-auto-rebalance-on-imbalanced-clusters.md

Signed-off-by: ShubhamRwt <[email protected]>

106-auto-rebalance-on-imbalanced-clusters.md

tinaselenge

Overall, it looks really good to me. I left a couple of questions to clarify my understanding. Thanks!

106-auto-rebalance-on-imbalanced-clusters.md

Signed-off-by: ShubhamRwt <[email protected]>

katheris

Thanks for addressing my previous comments @ShubhamRwt. I added one more related to how the user can clear the unfixable list in the ConfigMap.

I also saw a bunch of discussion about FSM and want to confirm my understanding. So from what I understand basically the RebalanceOnImbalance state is only reachable from Idle. This is because when a new anomaly is detected if there is a rebalance currently ongoing it might fix the anomaly, so we don't ever queue an imbalance related rebalance, is that right?

Then if an imbalance related rebalance is ongoing and a scale up or scale down is needed we queue those in the same way we do today.

Am I understanding all of that right?

106-auto-rebalance-on-imbalanced-clusters.md

ShubhamRwt · 2025-11-26T16:19:30Z

Yes @katheris you are right for the case where we are reachable to RebalanceOnImbalance only from Idle state. In case, if an imbalance related rebalance is ongoing, and a scale up or scale down is asked we will stop the imbalance.(just like we do today for scale up in case a scale down is asked). So if a RebalanceonImbalance is ongoing and any scale up operation happens, we will stop the imbalance auto-rebalance and proceed first with scaling operation. The order shoulder be -> scale-down operation, then scale-up and then imbalance rebalance. I will be making things more clear in the FSM diagram explanation

Signed-off-by: ShubhamRwt <[email protected]>

tinaselenge

Thanks @ShubhamRwt for addressing the comments. I am happy with the proposal, +1 non-binding

ppatierno

@ShubhamRwt we are almost there ... keep doing the nice work on it. I left few comments but by addressing them I will approve the proposal.

106-auto-rebalance-on-imbalanced-clusters.md

katheris · 2025-12-11T14:23:39Z

106-auto-rebalance-on-imbalanced-clusters.md

+Once the warning is issued, the operator will clear the `unfixableAnomalies` list.
+In case the user doesn't fix the anomaly, the `unfixableAnomalies` list will be populated again with the unfixable goal violation, and it will keep happening until the user fixes the anomaly manually.


I don't think this is right still, on subsequent reconciliations how will the operator know whether to keep the warning in the status. Shouldn't we clear the list of unfixable anomalies at the same time as clearing the fixable anomalies? So after a rebalance has happened? That way the operator knows on each reconciliation whether to leave the warning there, or to remove it

I see your point @katheris.
First, let me highlight that when an anomaly has unfixable goals, there is no rebalance going to happen so we can't clear the list "after a rebalance has happened".
At the same time we need anyway to clear the anomaly from the list because, unluckily, when the same anomaly is detected by Cruise Control later on, it will got assigned a different ID. So if we don't delete the previous one (which is actually the same anomaly but with a different ID), we'll get the ConfigMap growing indefinitely.
Right now I can't see a good solution unless not adding the warning in the Kafka CR but just a log into the Kafka operator that an anomaly was ignored because not fixable.

@ppatierno I know that a rebalance won't be triggered for the unfixable anomalies, however for things like disk failures it's somewhat likely that a rebalance would be triggered by the user after the disk is replaced, so if we are going to tie clearing the list to anything then a rebalance seems an ok choice.

In terms of the ID, thanks that's useful information I wasn't aware of. So I guess it will be same type of violation but a different ID? Could we just list the violation in the warning and in the ConfigMap when we detect a new violation remove the old one?

That still doesn't fix the problem of the warning being constantly added and removed.

Will the Warnings be removed from the Kafka CR status? I can see how we need to prevent warnings for the same violation piling up in the status.

Could we add some logic to dedup the warnings. So if a warning based on DiskDistributionGoal has been issued in the last N hours, don't post it again?

so if we are going to tie clearing the list to anything then a rebalance seems an ok choice.

What if the user for any reasons doesn't start a rebalancing? The operator won't clear the list, CC will continue to report the same anomaly but with a different ID by adding it to the CM which will grow indefinitely.

Could we just list the violation in the warning and in the ConfigMap when we detect a new violation remove the old one?

I can imagine the following scenario. Anomaly ID = 1 brought goal violation A and B (not fixable). Then CC detects again anomaly B but together with another goal violation C and all are brought with anomaly ID = 2. I don't think that operator should delete anomaly ID = 1 from ConfigMap because the same violation B was reported by anomaly ID = 2 as well. We would lost violation A. Unless it was fixed? 🤔 It's really tricky.

Signed-off-by: ShubhamRwt <[email protected]>

scholzj · 2025-12-15T10:12:16Z

106-auto-rebalance-on-imbalanced-clusters.md

+## Motivation
+
+Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own.


What is very time-consuming about this for larger clusters? Isn't the process exactly the same for small and large cluster when you use KafkaRebalance? It just takes longer between the steps. But that does not mean the process itself is more time-consuming.

scholzj · 2025-12-15T10:20:12Z

106-auto-rebalance-on-imbalanced-clusters.md

+* `CHECK` - Check the anomaly at a later time
+* `IGNORE` - Ignore the anomaly
+
+The default notifier enabled by Cruise Control is the  `NoopNotifer`.


Suggested change

The default notifier enabled by Cruise Control is the `NoopNotifer`.

The default notifier enabled by Cruise Control is the `NoopNotifer`.

scholzj · 2025-12-15T10:24:45Z

106-auto-rebalance-on-imbalanced-clusters.md

+We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
+We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.
+For the latter case, it would be better to spin up new disks or to fix the issue with the broker(s) directly instead of just moving the partitions replicas away from them through rebalancing.
+Therefore, given the interventions required, it would be non-trivial for the Strimzi Operator to fix these failures automatically.


I do not think it is question of triviality. The main reason to spin it off into separate proposal is that the solution for them is NOT running a rebalance.

scholzj · 2025-12-15T10:29:12Z

106-auto-rebalance-on-imbalanced-clusters.md

+If the user specifies a rebalance template then the goals mentioned in the rebalance template will be validated against the anomaly detection goals being used by Cruise Control.
+The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control.
+This is to ensure rebalances are only being run with goals that Cruise Control is checking for anomalies.
+If the goals mentioned in the templates are not a subset of the configured anomaly detection goals then we will add a warning condition in the Kafka CR regarding the failed validation.


So, what is the point of the template if the user seems to be unable to configure the goals really?

scholzj · 2025-12-15T10:30:10Z

106-auto-rebalance-on-imbalanced-clusters.md

+The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode.
+The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.
+
+The generated `KafkaRebalance` custom resource will be called `<my-cluster-name>-auto-rebalancing-imbalance-<anomalyId>`, where the `<my-cluster-name>` part comes from the `metadata.name` in the `Kafka` custom resource, `imbalance` refers to applying the rebalance to all the brokers, and the `<anomalyId>` would be retrieved from the notifier.


How does the anomaly ID look like? It is important to be able to judge the maximal length of the names ...

This is a good point, there is a 253 character limit and a long cluster name (I have seem some very long ones in the wild) combined with the anomalyID could conceivably push you over that limit?

Also, I know it is unlikely but it is possible for there to be multiple anomalies in ConfigMap. In that case which one is used for the naming?

Ultimately, is the anomalyID in the name actually helpful, can the user use it for anything? As I understand it, we will only ever have one auto-generated KR CR of type imbalance at any one time (or could one stick around, perhaps after an error?) so why not a fixed suffix after the kafka cluster name?

scholzj · 2025-12-15T10:39:26Z

106-auto-rebalance-on-imbalanced-clusters.md

+In case we get an anomaly which has both fixable goals and unfixable goals, we will still `IGNORE` that anomaly. 
+This is because Cruise Control won't be able to generate a proposal that can fix all the configured goals for rebalance.
+
+The created ConfigMap will persist when `imbalance` mode is enabled for auto-rebalance and will be deleted if the user decides to remove `imbalance` mode from `spec.kafka.cruiseControl.autoRebalance`.


Wait, above you said it would be deleted before the rebalance starts 🤔

Above I only said that the fixable anomalies list will be cleared. I don't see where I mentioned that the Configmap would be delete

Well, clearly hat was not really how I understood the text. That also does not solv the issue of conflicts.

Can you point me which text you are referring to? I can fix it if it sounds confusing

scholzj · 2025-12-15T10:42:25Z

106-auto-rebalance-on-imbalanced-clusters.md

+The advantages of using a single ConfigMap for all the detected anomalies versus one per anomaly are:
+1. It provides a single point of reference for all the anomalies
+2. When dealing with multiple Kafka clusters, the number of ConfigMap(s) created will be equal to the number of clusters deployed


I get why these two advantages matter. But I think the lifecycle of the config map as you describe it below is a mess and will not work well.

If nothing else, you should properly describe it and not have it spread through the text. Maybe it would work and is just badly described. But I wonder if it is really the case an whether a more event driven approach is needed.

Why? We are only maintaining one Configmap per cluster. The configmap will always exist unless the imbalance mode for autorebalance is disabled

@scholzj I am interested in what you mean by "a more event driven approach"? The core issue is the communication of detected anomalies between the Notifier running in CC and the operator. There are many ways to solve that, the CM being one of them. I am interested in what other options there might be, we already looked at a dedicated KafkaAnomaly CR.

@tomncooper While I understand the advantage of the single Config Map, I do not think it would work the way it is designed here. The worry is not just race conditions in the sense of losing some information, as I assume unfixed anomalies might be raised again at some point. But just the handling of 409 conflicts will be a mess which you would need to handle somehow and that somehow will likely be failed reconciliation.

So I think it needs either ...:

A clearly defined ownership where either the anomalies and their state are owned by CC and it is responsible for cleaning them (the question is, does that really need the CM? Can't REST API do this?)

Alternatively, if we do not think CC can manage the whole lifecycle (which might be the case, I do not feel like I have the expertise to say that for sure), then we should utilize a pattern where CC produces the events and CO consumes them in a safe fashion.

That could be for example stream of Config Maps being produced by CC and consumed (deleted) by CO. They could for example have the names based on some timestamps (<cluster-name>-cv-2025-12-16-17-08-00). This might have its own challenges (for example we need some clear back-off interval how often CC produces these config maps). But the CO would be simply able to delete these as it processes them.

Obviously, with Kafka at hand and CC already using Kafka in similar way, one could also think about utilizing the Kafka cluster for this. For example by having some single partition topic and tracking offsets in the Kafka CR status. But this has its own challenges as well I guess to not confuse users with some internal consumer groups, size on disk, keeping the right replication factor, Having the operator blocked when the topic is unavailable, etc.

I do not think the dedicated CR solves this. A new CRD can maybe have the advantage of the .status section managed by the CO and the .spec section managed by CC. But even that produces 409 errors during parallel updates. But overall, I think this is more about how you create and manage the resources rather than what type of resource you have.

Maybe we could separate the access paths by having the CC notifier append new anomalies to the data in the CM, these could have a timestamp attached and would form a log of recently detected anomalies.

The operator would only ever read the CM data and when it reacts to an anomaly it attaches an annotation to the CM with the timestamp (and ID maybe) of the anomaly. The operator would use the annotation as a way to see if new anomalies have been added. The operator only ever reads the CM data and edits the annotation.

The CC notifier just appends anomalies, it would also have a GC function to clear up anomalies that are over a certain age. This has the added advantage of making it easy for a human operator to see what anomalies have been detected.

Also, to Jakubs point, can the CO just poll the CC REST API directly and see what anomalies have been detected (with a timestamp)? We could store a "last seen anomaly" somewhere and compare those to the list of detected anomalies.

@ShubhamRwt does the API have that kind of detail or does it have to go through the Notifier? Would we lose info like fixable vs unfixable?

does the API have that kind of detail or does it have to go through the Notifier? Would we lose info like fixable vs unfixable?

I cannto answer that, but I guess in theory the notifier can have its own REST API for this?

The conflict between the notifier and the operator to update the same CM is a good point. It would be of course the same with a dedicated CR (as already pointed out by Jakub).
We could think about a "pull" based mechanism (compared to the "push" one we have right now) for the operator to get information about the detected anomalies.
Cruise Control exposes the /kafkacruisecontrol/state in order to get the anomaly_detector substate which is described by the following OpenAPI spec https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/resources/yaml/responses/anomalyDetectorState.yaml

The AnomalyDetails object contains pretty much what we need (with fixable and unfixable violations) with a status which would always be IGNORED in our case (set by the notifier). But it's important to notice that what you get is a recentGoalViolations list ... and IIRC from tests I ran in the past, old goal violations stay there for some time (even if I can't see a FIXED status enum, so maybe the fixed ones go away). Anyway, the operator should understand which ones are now fixed or not by running its own rebalancing. Or which ones are currently being fixed because a rebalancing is running.
Other than that, we need a way for the notifier to notify the operator to take a look at that endpoint because some anomalies were detected. At this point maybe a dedicated Kafka topic could be a solution (?).
To avoid this, it could be the operator to pull from that endpoint on each reconciliation so making the notifier totally useless (the NooP one is good enough just to set IGNORE for any anomaly) and having a full pull based mechanism from the operator perspective. But again how the operator could understand from the "recent" list of goal violations which ones need to be addressed or the ones already addressed (if they stay there for longer, as part of the history). Also the operator reconciliation period would not match with the anomaly detector period (which is when the anomaly is detected and a new status could be reported when the API is hit by the operator).

Other than that, we need a way for the notifier to notify the operator to take a look at that endpoint because some anomalies were detected. At this point maybe a dedicated Kafka topic could be a solution (?).

I think checking the API every reconciliation is easier. If needed, you set some backoff interval to for example wait for at least 15 minutes after last auto-rebalance before checking it or before acting on it. Especially if you need to give CC time to figure out what was resolved.

This is how the Anomaly detector state looks like

AnomalyDetectorState: {selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT], selfHealingDisabled:[], selfHealingEnabledRatio:{BROKER_FAILURE=1.0, DISK_FAILURE=1.0, GOAL_VIOLATION=1.0, METRIC_ANOMALY=1.0, TOPIC_ANOMALY=1.0, MAINTENANCE_EVENT=1.0}, recentGoalViolations:[{anomalyId=c2071b83-b011-4924-8fa9-8d3cb0b2ebb9, detectionDate=2024-12-17T09:26:00Z, statusUpdateDate=2024-12-17T09:26:04Z, fixableViolatedGoals=[DiskUsageDistributionGoal], unfixableViolatedGoals=[], status=FIX_STARTED}], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], recentTopicAnomalies:[], recentMaintenanceEvents:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:3.07 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds, DISK_FAILURE:0.00 milliseconds, TOPIC_ANOMALY:0.00 milliseconds}, meanTimeToStartFix:2.89 seconds, numSelfHealingStarted:1, numSelfHealingFailedToStart:0, ongoingAnomalyDuration=0.00 milliseconds}, ongoingSelfHealingAnomaly:None, balancednessScore:88.450}

The recentGoalViolations will have the anomalies even if they are resolved. I am not sure how we will differentiate if the violation is fixed or not. The only thing we could know is that whether a fix has started for the anomaly or not. I will try to see when this list gets cleared by running some tests but maybe the only way to clear the list would be when CC pod gets rolled again

scholzj · 2025-12-15T10:45:14Z

106-auto-rebalance-on-imbalanced-clusters.md

+        The detected <anomaly> has unfixable `DiskDistributionGoal` goal.
+```
+
+Once the warning is issued, the operator will clear the `unfixableAnomalies` list.


I'm not sure if this works given the nature of the condistions in Kafka CR. They need to be treated as ephemeral.

scholzj · 2025-12-15T10:47:09Z

106-auto-rebalance-on-imbalanced-clusters.md

+Therefore, we will add a new metric of counter type named `anomaly_detection_metrics` which will be updated whenever an anomaly is detected by the anomaly detector. We will also add labels to it to differentiate them based on fixability and type of anomaly.
+These metrics will be exposed by the operator and will be only available when the user has configured auto-rebalance on `imbalance` mode.


This should have more details because it does not seem like you can provide this metric in a reliable fashion from how you treat the anomalies. Or maybe you don't sufficiently explain what the metric will show ...

scholzj · 2025-12-15T10:49:38Z

106-auto-rebalance-on-imbalanced-clusters.md

+2. The `KafkaAutoRebalancingReconciler.reconcile()` will then check if there was any ConfigMap created with name `<cluster-name>-goal-violations` and whether the `fixableAnomalies` and `unfixableAnomalies` list is empty or not, then the `full` rebalance (imbalance mode) would be performed if the `unfixableAnomalies` list is empty and the `fixableAnomalies` list is not empty.
+3. If a rebalance is already ongoing and more anomalies are detected, then the operator will just ignore the new anomalies and delete all the anomalies from the `fixableAnomalies` list in the ConfigMap.
+4. Any anomalies that are not resolved by the ongoing rebalance will be redetected by the anomaly detector once the FSM returns to the `Idle` state.


This cycle of deleting and recreating the config maps by the CO and the notifier probably needs some time based limits? How often will the notifier recreate the CM?

But we are never deleting the configmap unless we disable the imbalance mode for autorebalance

tomncooper

I have had another pass. I think the main issue is that, as Jakub highlighted, the lifecycle of the Anomaly CM is quite complicated.

I think the proposal would benefit from a description of the states the ACM will move through and what happens under different scenarios. When you have two process writing to and reading from the same source concurrently, you need to make sure there aren't any race conditions that could bite you. I suspect the relatively long anomaly detection period will help here but it is worth being methodical.

I also think you may need a more sophisticated approach to the rebalance "completed with error" warnings in the Kafka CR status. You need to prevent thrashing or multiple warnings piling up for the same issue.

tomncooper · 2025-12-16T14:27:29Z

106-auto-rebalance-on-imbalanced-clusters.md

+The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode.
+The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.
+
+The generated `KafkaRebalance` custom resource will be called `<my-cluster-name>-auto-rebalancing-imbalance-<anomalyId>`, where the `<my-cluster-name>` part comes from the `metadata.name` in the `Kafka` custom resource, `imbalance` refers to applying the rebalance to all the brokers, and the `<anomalyId>` would be retrieved from the notifier.


This is a good point, there is a 253 character limit and a long cluster name (I have seem some very long ones in the wild) combined with the anomalyID could conceivably push you over that limit?

Also, I know it is unlikely but it is possible for there to be multiple anomalies in ConfigMap. In that case which one is used for the naming?

Ultimately, is the anomalyID in the name actually helpful, can the user use it for anything? As I understand it, we will only ever have one auto-generated KR CR of type imbalance at any one time (or could one stick around, perhaps after an error?) so why not a fixed suffix after the kafka cluster name?

tomncooper · 2025-12-16T14:31:40Z

106-auto-rebalance-on-imbalanced-clusters.md

+metadata:
+  name: my-cluster-goal-violations
+data:
+  fixableAnomalies: |


Nit: I wonder if it would be better to use a JSON format here, rather than this nested YAML structure? I would be less susceptible to parsing issues with indentations etc.

@tomncooper please take a look at this comment here #161 (comment) I tagged you and others because I think we should change the structure to be more anomaly IDs driven.

tomncooper · 2025-12-16T14:39:49Z

106-auto-rebalance-on-imbalanced-clusters.md

+The default anomaly detection goals set by the operator are `RACK_AWARENESS_GOAL`, `MIN_TOPIC_LEADERS_PER_BROKER_GOAL`, `REPLICA_CAPACITY_GOAL`, `DISK_CAPACITY_GOAL`. 
+These are similar to the default goals used in for KafkaRebalance if the users don't mention the rebalance goals.
+If the user specifies a rebalance template then the goals mentioned in the rebalance template will be validated against the anomaly detection goals being used by Cruise Control.
+The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control.


Not subset, a superset. The goals in the KR template need to include all the anomaly detection goals but they could have other goals in there too.

Suggested change

The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control.

The goals in the rebalance template should be either the same or be a superset of the anomaly detection goals being used by Cruise Control.

tomncooper · 2025-12-16T14:53:05Z

106-auto-rebalance-on-imbalanced-clusters.md

+Once the warning is issued, the operator will clear the `unfixableAnomalies` list.
+In case the user doesn't fix the anomaly, the `unfixableAnomalies` list will be populated again with the unfixable goal violation, and it will keep happening until the user fixes the anomaly manually.


Will the Warnings be removed from the Kafka CR status? I can see how we need to prevent warnings for the same violation piling up in the status.

Could we add some logic to dedup the warnings. So if a warning based on DiskDistributionGoal has been issued in the last N hours, don't post it again?

tomncooper · 2025-12-16T15:22:51Z

106-auto-rebalance-on-imbalanced-clusters.md

+The `fixableAnomalies` list will state the `anomalyId` of the anomaly as well as the fixable violated goals.
+The `fixableAnomalies` list will be populated by the notifier every time it detects an anomaly.
+Whenever a rebalance is about to be triggered by the operator the `fixableAnomalies` list will be cleared.
+This will be the case for manual rebalances triggered by the user, as well as auto-rebalance rebalances that are happening in `add-broker`, `remove-broker`, or `imbalance` mode.


I agree there needs to be more discussion of how the anomaly CM is handled. Much like the various FSM states you need to step through the lifecycle of the CM and make sure their aren't any race conditions that could occur between the notifier (writing) and the operator (reading and clearing).

tomncooper · 2025-12-16T15:25:00Z

106-auto-rebalance-on-imbalanced-clusters.md

+The advantages of using a single ConfigMap for all the detected anomalies versus one per anomaly are:
+1. It provides a single point of reference for all the anomalies
+2. When dealing with multiple Kafka clusters, the number of ConfigMap(s) created will be equal to the number of clusters deployed


@scholzj I am interested in what you mean by "a more event driven approach"? The core issue is the communication of detected anomalies between the Notifier running in CC and the operator. There are many ways to solve that, the CM being one of them. I am interested in what other options there might be, we already looked at a dedicated KafkaAnomaly CR.

Added proposal for self-healing

abb4253

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt requested review from ppatierno, scholzj and tomncooper July 14, 2025 11:45

ppatierno reviewed Jul 15, 2025

View reviewed changes

tomncooper reviewed Jul 15, 2025

View reviewed changes

ppatierno requested review from Frawless, PaulRMellor, im-konge, katheris, kyguy, samuel-hawker, see-quick, sknot-rh and tombentley July 15, 2025 17:12

scholzj reviewed Jul 16, 2025

View reviewed changes

106-self-healing-feature-in-operator.md Outdated Show resolved Hide resolved

106-self-healing-feature-in-operator.md Outdated Show resolved Hide resolved

ShubhamRwt added 3 commits July 23, 2025 13:55

Added suggestions by Jakub, Tom and Paolo

9f538ff

Signed-off-by: ShubhamRwt <[email protected]>

Updated the heading

bf1e952

Signed-off-by: ShubhamRwt <[email protected]>

Added future scope section

a8ce365

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt changed the title ~~Added proposal for self-healing feature in operator~~ Added proposal for auto-rebalance on imbalanced cluster feature in operator Jul 24, 2025

ppatierno reviewed Aug 1, 2025

View reviewed changes

tomncooper reviewed Aug 1, 2025

View reviewed changes

im-konge reviewed Aug 8, 2025

View reviewed changes

tinaselenge reviewed Aug 11, 2025

View reviewed changes

see-quick reviewed Aug 20, 2025

View reviewed changes

Frawless reviewed Aug 20, 2025

View reviewed changes

Added suggestions

65fada5

Signed-off-by: ShubhamRwt <[email protected]>

tomncooper reviewed Oct 31, 2025

View reviewed changes

Added suggestions by Tom and Jakub

9453f7f

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt force-pushed the newHealing branch from 9eff83e to 9453f7f Compare November 7, 2025 11:08

Minor fixes

7a81346

Signed-off-by: ShubhamRwt <[email protected]>

tomncooper reviewed Nov 7, 2025

View reviewed changes

ppatierno reviewed Nov 7, 2025

View reviewed changes

tinaselenge reviewed Nov 11, 2025

View reviewed changes

Added suggestions by Tina, Tom and Paolo

cf66d98

Signed-off-by: ShubhamRwt <[email protected]>

katheris reviewed Nov 24, 2025

View reviewed changes

Added suggestions by Kate

f86bf35

Signed-off-by: ShubhamRwt <[email protected]>

ppatierno reviewed Nov 25, 2025

View reviewed changes

tinaselenge reviewed Nov 25, 2025

View reviewed changes

106-auto-rebalance-on-imbalanced-clusters.md Show resolved Hide resolved

106-auto-rebalance-on-imbalanced-clusters.md Outdated Show resolved Hide resolved

ShubhamRwt added 2 commits November 26, 2025 13:54

Added suggestions by Paolo and Tina

9b7738a

Signed-off-by: ShubhamRwt <[email protected]>

Update image

de6c286

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt force-pushed the newHealing branch from 63a502a to de6c286 Compare November 26, 2025 08:39

katheris reviewed Nov 26, 2025

View reviewed changes

106-auto-rebalance-on-imbalanced-clusters.md Show resolved Hide resolved

ShubhamRwt added 2 commits November 28, 2025 14:32

Added more changes

062efeb

Signed-off-by: ShubhamRwt <[email protected]>

Added suggestion by Tina and Kate

ce26f07

Signed-off-by: ShubhamRwt <[email protected]>

tinaselenge approved these changes Dec 11, 2025

View reviewed changes

ppatierno reviewed Dec 11, 2025

View reviewed changes

katheris reviewed Dec 11, 2025

View reviewed changes

ShubhamRwt added 2 commits December 12, 2025 16:47

Added the changes suggested by Paolo

2f710d5

Signed-off-by: ShubhamRwt <[email protected]>

Fixed the missing change

93d4c5c

Signed-off-by: ShubhamRwt <[email protected]>

tinaselenge mentioned this pull request Dec 12, 2025

Allow per listener configurations strimzi/strimzi-kafka-operator#12239

Open

8 tasks

scholzj reviewed Dec 15, 2025

View reviewed changes

tomncooper reviewed Dec 16, 2025

View reviewed changes

		Once the warning is issued, the operator will clear the `unfixableAnomalies` list.
		In case the user doesn't fix the anomaly, the `unfixableAnomalies` list will be populated again with the unfixable goal violation, and it will keep happening until the user fixes the anomaly manually.

	The default notifier enabled by Cruise Control is the `NoopNotifer`.
	The default notifier enabled by Cruise Control is the `NoopNotifer`.

		Therefore, we will add a new metric of counter type named `anomaly_detection_metrics` which will be updated whenever an anomaly is detected by the anomaly detector. We will also add labels to it to differentiate them based on fixability and type of anomaly.
		These metrics will be exposed by the operator and will be only available when the user has configured auto-rebalance on `imbalance` mode.

	The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control.
	The goals in the rebalance template should be either the same or be a superset of the anomaly detection goals being used by Cruise Control.

Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

Are you sure you want to change the base?

Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

Uh oh!

Conversation

ShubhamRwt commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomncooper left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ppatierno commented Jul 17, 2025

Uh oh!

scholzj commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomncooper left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomncooper left a comment •

edited

Loading

ppatierno commented Aug 19, 2025 •

edited

Loading