Skip to content

Conversation

@ShubhamRwt
Copy link
Contributor

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal #145. This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass. I think this is a better proposal which is more in line with how Strimzi currently uses CC.

I think you need more detail on the interaction with the current auto-rebalancing and also a clearer description of the FSM states and their transitions. I found it hard to follow the sequence you are proposing.

For the notifier, I actually think we should stop users using custom notifiers (we could make it conditional on the full mode being set or not). As we are creating K8s resources in response to detected anomalies users can create alerting based on that if they need it. If users do need that then we could provide implementations of the various notifiers which extend our notifier rather than the CC one.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going in the right direction. But I think it needs to go a bit deeper:

  • We need to establish our own terminology and not take over the Cruise Control one. There is not really any self-healing and most of the anomalies are not really anomalies.
  • If I read this proposal right, you want to focus on when the cluster is out-of-balance. That is a great start. But perhaps that should not be called mode: full? Calling it full seems confusing - does it mean that full includes scale-up / scale-down? Also, I guess in the future we would add some actual self-healing to handle the broken disks or brokers. That might create additional modes probably. So maybe mode: rebalance or mode: skew or something like that would make more sense?

@ppatierno
Copy link
Member

@scholzj good to know that you like the track we are now :-)

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

@scholzj
Copy link
Member

scholzj commented Jul 17, 2025

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

I do not think this works here. KafkaRebalance is essentially an imperative API (although implemented through a declarative resource). You are sending a command to the CO to do a full rebalance.

The autoRebalance section in the Kafka CR is a declarative API. You are declaring how CO should automatically react to some situations. add-brokers and remove-brokers works well in both as it is a command as well as event description. full IMHO does not work that well in the declarative mode because as I said, it can be easily interpreted as full == all available options (i.e. including scale-up or scale-down). That is where the idea of skew comes from as from my understanding in this proposal we are reacting to skew -> the skew can be a CPU inbalance, Disk inbalance etc.

goal-violation sounds reasonable ... but I wonder if it is too generic. I assume that the future modes ... e.g. CCs suggestion to scale-up, scale-down, bad distribution across racks, broken disks or brokers ... those are also goal violations, or? But you cannot solve these by creating a KafkaRebalance. So they will need their own modes as well. That is kind of the context in whcih I'm trying to see the mode names.

@ShubhamRwt ShubhamRwt changed the title Added proposal for self-healing feature in operator Added proposal for auto-rebalance on imbalanced cluster feature in operator Jul 24, 2025
Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I did another pass. I have a few questions:

  • How are you going to distinguish anomaly CMs from different Kafka clusters in the same namespace. I know it is not recommened, but user do deploy multiple Kafka clusters in the same NS.
  • You need to deal with GC'ing all these anomaly CMs in the case where a rebalance is on going. Do you delete them? Do you have some kind of timeout based on the detection interval?
  • It is not clear what you mean by scale up/down auto-rebalances being queued up? I assume you mean generated KafkaRebalance CRs? But it is not clear.

@nickgarvey
Copy link

Chiming in as an end user - glad to see this proposal! We have been debating internally if we want to have a cronjob to issue rebalances, this is a lot better. In particular the model of using CruiseControl's anomaly detection while issuing the rebalances through KafkaRebalance CRs seems like it will fit perfect into our workflows.

I see discussion on how to represent the anomalies. Any solution here is fine for us, I envision we will mostly be interacting with the KafkaRebalance CR and not much with anything else.

An area that could be explicit is the right way to stop all rebalances and not issue any more. Rebalance operations often saturate bandwidth, either disk or network, and cause major latency during producing. We often find ourselves needing to cancel them as we scale and learn our limits. It looks like we might be able to delete mode: skew on the CruiseControl CR to stop automatic rebalances, but it could be more clear.

Thanks for putting this together, excited to see this.

@ppatierno
Copy link
Member

ppatierno commented Aug 19, 2025

@nickgarvey Thanks for the feedback! Usually you are able to stop the current rebalancing by applying the stop annotation on the KafkaRebalance (of course the current batch has to finish first). With auto-rebalancing, the KafkaRebalance is owned by the operator and not by the user. That's anyway a good feedback because there is no clear way for the user to stop an auto-rebalancing in progress. I think you could apply the stop annotation on the KafkaRebalance resource but you can't delete it due to a finalizer. Then you should delete the corresponding mode within the spec.cruiseControl.autoRebalance.mode field to avoid the re-triggering. It's something to think about.

Signed-off-by: ShubhamRwt <[email protected]>
Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had another pass, mostly just suggestions around language and grammar.

However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had another pass, mostly just suggestions around language and grammar.

However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.

Signed-off-by: ShubhamRwt <[email protected]>
@ShubhamRwt
Copy link
Contributor Author

Hello @scholzj @ppatierno @tinaselenge @kyguy @tomncooper @katheris @im-konge @see-quick @Frawless . I have addressed and resolved all the comments on the proposal. The proposal is ready for review again. It would be great if you can find some time to have another pass.

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section on the StrimziCruiseControlNotifier, particularly the CM format needs review. It is not clear what the unfixable anomalies/goal list is actually going to contain? You also need to make clear what happens if all the detected anomalies are unfixable?

Copy link
Contributor

@tinaselenge tinaselenge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update Shubham. It's been a while since the last time I read the proposal, and can see quite a lot changed. I like the proposed solution to use the auto-rebalancing and let the operator handle the healing.

Overall I'm happy with the proposal, but I think there are a few parts that need to be ironed out with more clarity.

Copy link
Member

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ShubhamRwt. I've added some formatting suggestions and a couple of clarifying questions.

The only other area I wondered about was whether we should consider making use of the maintenance windows? So for example not triggering a rebalance unless we are within a maintenance window. What do you think @scholzj ?

Signed-off-by: ShubhamRwt <[email protected]>
Copy link
Contributor

@tinaselenge tinaselenge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks really good to me. I left a couple of questions to clarify my understanding. Thanks!

Copy link
Member

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my previous comments @ShubhamRwt. I added one more related to how the user can clear the unfixable list in the ConfigMap.

I also saw a bunch of discussion about FSM and want to confirm my understanding. So from what I understand basically the RebalanceOnImbalance state is only reachable from Idle. This is because when a new anomaly is detected if there is a rebalance currently ongoing it might fix the anomaly, so we don't ever queue an imbalance related rebalance, is that right?

Then if an imbalance related rebalance is ongoing and a scale up or scale down is needed we queue those in the same way we do today.

Am I understanding all of that right?

@ShubhamRwt
Copy link
Contributor Author

ShubhamRwt commented Nov 26, 2025

Yes @katheris you are right for the case where we are reachable to RebalanceOnImbalance only from Idle state. In case, if an imbalance related rebalance is ongoing, and a scale up or scale down is asked we will stop the imbalance.(just like we do today for scale up in case a scale down is asked). So if a RebalanceonImbalance is ongoing and any scale up operation happens, we will stop the imbalance auto-rebalance and proceed first with scaling operation. The order shoulder be -> scale-down operation, then scale-up and then imbalance rebalance. I will be making things more clear in the FSM diagram explanation

Copy link
Contributor

@tinaselenge tinaselenge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ShubhamRwt for addressing the comments. I am happy with the proposal, +1 non-binding

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShubhamRwt we are almost there ... keep doing the nice work on it. I left few comments but by addressing them I will approve the proposal.

Comment on lines +343 to +344
Once the warning is issued, the operator will clear the `unfixableAnomalies` list.
In case the user doesn't fix the anomaly, the `unfixableAnomalies` list will be populated again with the unfixable goal violation, and it will keep happening until the user fixes the anomaly manually.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right still, on subsequent reconciliations how will the operator know whether to keep the warning in the status. Shouldn't we clear the list of unfixable anomalies at the same time as clearing the fixable anomalies? So after a rebalance has happened? That way the operator knows on each reconciliation whether to leave the warning there, or to remove it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point @katheris.
First, let me highlight that when an anomaly has unfixable goals, there is no rebalance going to happen so we can't clear the list "after a rebalance has happened".
At the same time we need anyway to clear the anomaly from the list because, unluckily, when the same anomaly is detected by Cruise Control later on, it will got assigned a different ID. So if we don't delete the previous one (which is actually the same anomaly but with a different ID), we'll get the ConfigMap growing indefinitely.
Right now I can't see a good solution unless not adding the warning in the Kafka CR but just a log into the Kafka operator that an anomaly was ignored because not fixable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppatierno I know that a rebalance won't be triggered for the unfixable anomalies, however for things like disk failures it's somewhat likely that a rebalance would be triggered by the user after the disk is replaced, so if we are going to tie clearing the list to anything then a rebalance seems an ok choice.

In terms of the ID, thanks that's useful information I wasn't aware of. So I guess it will be same type of violation but a different ID? Could we just list the violation in the warning and in the ConfigMap when we detect a new violation remove the old one?

That still doesn't fix the problem of the warning being constantly added and removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the Warnings be removed from the Kafka CR status? I can see how we need to prevent warnings for the same violation piling up in the status.

Could we add some logic to dedup the warnings. So if a warning based on DiskDistributionGoal has been issued in the last N hours, don't post it again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we are going to tie clearing the list to anything then a rebalance seems an ok choice.

What if the user for any reasons doesn't start a rebalancing? The operator won't clear the list, CC will continue to report the same anomaly but with a different ID by adding it to the CM which will grow indefinitely.

Could we just list the violation in the warning and in the ConfigMap when we detect a new violation remove the old one?

I can imagine the following scenario. Anomaly ID = 1 brought goal violation A and B (not fixable). Then CC detects again anomaly B but together with another goal violation C and all are brought with anomaly ID = 2. I don't think that operator should delete anomaly ID = 1 from ConfigMap because the same violation B was reported by anomaly ID = 2 as well. We would lost violation A. Unless it was fixed? 🤔 It's really tricky.

## Motivation

Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is very time-consuming about this for larger clusters? Isn't the process exactly the same for small and large cluster when you use KafkaRebalance? It just takes longer between the steps. But that does not mean the process itself is more time-consuming.

* `CHECK` - Check the anomaly at a later time
* `IGNORE` - Ignore the anomaly

The default notifier enabled by Cruise Control is the `NoopNotifer`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The default notifier enabled by Cruise Control is the `NoopNotifer`.
The default notifier enabled by Cruise Control is the `NoopNotifer`.

We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance.
We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures.
For the latter case, it would be better to spin up new disks or to fix the issue with the broker(s) directly instead of just moving the partitions replicas away from them through rebalancing.
Therefore, given the interventions required, it would be non-trivial for the Strimzi Operator to fix these failures automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is question of triviality. The main reason to spin it off into separate proposal is that the solution for them is NOT running a rebalance.

If the user specifies a rebalance template then the goals mentioned in the rebalance template will be validated against the anomaly detection goals being used by Cruise Control.
The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control.
This is to ensure rebalances are only being run with goals that Cruise Control is checking for anomalies.
If the goals mentioned in the templates are not a subset of the configured anomaly detection goals then we will add a warning condition in the Kafka CR regarding the failed validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what is the point of the template if the user seems to be unable to configure the goals really?

The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode.
The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.

The generated `KafkaRebalance` custom resource will be called `<my-cluster-name>-auto-rebalancing-imbalance-<anomalyId>`, where the `<my-cluster-name>` part comes from the `metadata.name` in the `Kafka` custom resource, `imbalance` refers to applying the rebalance to all the brokers, and the `<anomalyId>` would be retrieved from the notifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the anomaly ID look like? It is important to be able to judge the maximal length of the names ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, there is a 253 character limit and a long cluster name (I have seem some very long ones in the wild) combined with the anomalyID could conceivably push you over that limit?

Also, I know it is unlikely but it is possible for there to be multiple anomalies in ConfigMap. In that case which one is used for the naming?

Ultimately, is the anomalyID in the name actually helpful, can the user use it for anything? As I understand it, we will only ever have one auto-generated KR CR of type imbalance at any one time (or could one stick around, perhaps after an error?) so why not a fixed suffix after the kafka cluster name?

In case we get an anomaly which has both fixable goals and unfixable goals, we will still `IGNORE` that anomaly.
This is because Cruise Control won't be able to generate a proposal that can fix all the configured goals for rebalance.

The created ConfigMap will persist when `imbalance` mode is enabled for auto-rebalance and will be deleted if the user decides to remove `imbalance` mode from `spec.kafka.cruiseControl.autoRebalance`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, above you said it would be deleted before the rebalance starts 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above I only said that the fixable anomalies list will be cleared. I don't see where I mentioned that the Configmap would be delete

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, clearly hat was not really how I understood the text. That also does not solv the issue of conflicts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me which text you are referring to? I can fix it if it sounds confusing

Comment on lines +239 to +241
The advantages of using a single ConfigMap for all the detected anomalies versus one per anomaly are:
1. It provides a single point of reference for all the anomalies
2. When dealing with multiple Kafka clusters, the number of ConfigMap(s) created will be equal to the number of clusters deployed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get why these two advantages matter. But I think the lifecycle of the config map as you describe it below is a mess and will not work well.

If nothing else, you should properly describe it and not have it spread through the text. Maybe it would work and is just badly described. But I wonder if it is really the case an whether a more event driven approach is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? We are only maintaining one Configmap per cluster. The configmap will always exist unless the imbalance mode for autorebalance is disabled

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scholzj I am interested in what you mean by "a more event driven approach"? The core issue is the communication of detected anomalies between the Notifier running in CC and the operator. There are many ways to solve that, the CM being one of them. I am interested in what other options there might be, we already looked at a dedicated KafkaAnomaly CR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomncooper While I understand the advantage of the single Config Map, I do not think it would work the way it is designed here. The worry is not just race conditions in the sense of losing some information, as I assume unfixed anomalies might be raised again at some point. But just the handling of 409 conflicts will be a mess which you would need to handle somehow and that somehow will likely be failed reconciliation.

So I think it needs either ...:

  • A clearly defined ownership where either the anomalies and their state are owned by CC and it is responsible for cleaning them (the question is, does that really need the CM? Can't REST API do this?)
  • Alternatively, if we do not think CC can manage the whole lifecycle (which might be the case, I do not feel like I have the expertise to say that for sure), then we should utilize a pattern where CC produces the events and CO consumes them in a safe fashion.
    • That could be for example stream of Config Maps being produced by CC and consumed (deleted) by CO. They could for example have the names based on some timestamps (<cluster-name>-cv-2025-12-16-17-08-00). This might have its own challenges (for example we need some clear back-off interval how often CC produces these config maps). But the CO would be simply able to delete these as it processes them.
    • Obviously, with Kafka at hand and CC already using Kafka in similar way, one could also think about utilizing the Kafka cluster for this. For example by having some single partition topic and tracking offsets in the Kafka CR status. But this has its own challenges as well I guess to not confuse users with some internal consumer groups, size on disk, keeping the right replication factor, Having the operator blocked when the topic is unavailable, etc.

I do not think the dedicated CR solves this. A new CRD can maybe have the advantage of the .status section managed by the CO and the .spec section managed by CC. But even that produces 409 errors during parallel updates. But overall, I think this is more about how you create and manage the resources rather than what type of resource you have.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could separate the access paths by having the CC notifier append new anomalies to the data in the CM, these could have a timestamp attached and would form a log of recently detected anomalies.

The operator would only ever read the CM data and when it reacts to an anomaly it attaches an annotation to the CM with the timestamp (and ID maybe) of the anomaly. The operator would use the annotation as a way to see if new anomalies have been added. The operator only ever reads the CM data and edits the annotation.

The CC notifier just appends anomalies, it would also have a GC function to clear up anomalies that are over a certain age. This has the added advantage of making it easy for a human operator to see what anomalies have been detected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to Jakubs point, can the CO just poll the CC REST API directly and see what anomalies have been detected (with a timestamp)? We could store a "last seen anomaly" somewhere and compare those to the list of detected anomalies.

@ShubhamRwt does the API have that kind of detail or does it have to go through the Notifier? Would we lose info like fixable vs unfixable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the API have that kind of detail or does it have to go through the Notifier? Would we lose info like fixable vs unfixable?

I cannto answer that, but I guess in theory the notifier can have its own REST API for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conflict between the notifier and the operator to update the same CM is a good point. It would be of course the same with a dedicated CR (as already pointed out by Jakub).
We could think about a "pull" based mechanism (compared to the "push" one we have right now) for the operator to get information about the detected anomalies.
Cruise Control exposes the /kafkacruisecontrol/state in order to get the anomaly_detector substate which is described by the following OpenAPI spec https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/resources/yaml/responses/anomalyDetectorState.yaml

The AnomalyDetails object contains pretty much what we need (with fixable and unfixable violations) with a status which would always be IGNORED in our case (set by the notifier). But it's important to notice that what you get is a recentGoalViolations list ... and IIRC from tests I ran in the past, old goal violations stay there for some time (even if I can't see a FIXED status enum, so maybe the fixed ones go away). Anyway, the operator should understand which ones are now fixed or not by running its own rebalancing. Or which ones are currently being fixed because a rebalancing is running.
Other than that, we need a way for the notifier to notify the operator to take a look at that endpoint because some anomalies were detected. At this point maybe a dedicated Kafka topic could be a solution (?).
To avoid this, it could be the operator to pull from that endpoint on each reconciliation so making the notifier totally useless (the NooP one is good enough just to set IGNORE for any anomaly) and having a full pull based mechanism from the operator perspective. But again how the operator could understand from the "recent" list of goal violations which ones need to be addressed or the ones already addressed (if they stay there for longer, as part of the history). Also the operator reconciliation period would not match with the anomaly detector period (which is when the anomaly is detected and a new status could be reported when the API is hit by the operator).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that, we need a way for the notifier to notify the operator to take a look at that endpoint because some anomalies were detected. At this point maybe a dedicated Kafka topic could be a solution (?).

I think checking the API every reconciliation is easier. If needed, you set some backoff interval to for example wait for at least 15 minutes after last auto-rebalance before checking it or before acting on it. Especially if you need to give CC time to figure out what was resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the Anomaly detector state looks like

AnomalyDetectorState: {selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT], selfHealingDisabled:[], selfHealingEnabledRatio:{BROKER_FAILURE=1.0, DISK_FAILURE=1.0, GOAL_VIOLATION=1.0, METRIC_ANOMALY=1.0, TOPIC_ANOMALY=1.0, MAINTENANCE_EVENT=1.0}, recentGoalViolations:[{anomalyId=c2071b83-b011-4924-8fa9-8d3cb0b2ebb9, detectionDate=2024-12-17T09:26:00Z, statusUpdateDate=2024-12-17T09:26:04Z, fixableViolatedGoals=[DiskUsageDistributionGoal], unfixableViolatedGoals=[], status=FIX_STARTED}], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], recentTopicAnomalies:[], recentMaintenanceEvents:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:3.07 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds, DISK_FAILURE:0.00 milliseconds, TOPIC_ANOMALY:0.00 milliseconds}, meanTimeToStartFix:2.89 seconds, numSelfHealingStarted:1, numSelfHealingFailedToStart:0, ongoingAnomalyDuration=0.00 milliseconds}, ongoingSelfHealingAnomaly:None, balancednessScore:88.450}

The recentGoalViolations will have the anomalies even if they are resolved. I am not sure how we will differentiate if the violation is fixed or not. The only thing we could know is that whether a fix has started for the anomaly or not. I will try to see when this list gets cleared by running some tests but maybe the only way to clear the list would be when CC pod gets rolled again

The detected <anomaly> has unfixable `DiskDistributionGoal` goal.
```

Once the warning is issued, the operator will clear the `unfixableAnomalies` list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this works given the nature of the condistions in Kafka CR. They need to be treated as ephemeral.

Comment on lines +360 to +361
Therefore, we will add a new metric of counter type named `anomaly_detection_metrics` which will be updated whenever an anomaly is detected by the anomaly detector. We will also add labels to it to differentiate them based on fixability and type of anomaly.
These metrics will be exposed by the operator and will be only available when the user has configured auto-rebalance on `imbalance` mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have more details because it does not seem like you can provide this metric in a reliable fashion from how you treat the anomalies. Or maybe you don't sufficiently explain what the metric will show ...

Comment on lines +404 to +406
2. The `KafkaAutoRebalancingReconciler.reconcile()` will then check if there was any ConfigMap created with name `<cluster-name>-goal-violations` and whether the `fixableAnomalies` and `unfixableAnomalies` list is empty or not, then the `full` rebalance (imbalance mode) would be performed if the `unfixableAnomalies` list is empty and the `fixableAnomalies` list is not empty.
3. If a rebalance is already ongoing and more anomalies are detected, then the operator will just ignore the new anomalies and delete all the anomalies from the `fixableAnomalies` list in the ConfigMap.
4. Any anomalies that are not resolved by the ongoing rebalance will be redetected by the anomaly detector once the FSM returns to the `Idle` state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cycle of deleting and recreating the config maps by the CO and the notifier probably needs some time based limits? How often will the notifier recreate the CM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we are never deleting the configmap unless we disable the imbalance mode for autorebalance

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had another pass. I think the main issue is that, as Jakub highlighted, the lifecycle of the Anomaly CM is quite complicated.

I think the proposal would benefit from a description of the states the ACM will move through and what happens under different scenarios. When you have two process writing to and reading from the same source concurrently, you need to make sure there aren't any race conditions that could bite you. I suspect the relatively long anomaly detection period will help here but it is worth being methodical.

I also think you may need a more sophisticated approach to the rebalance "completed with error" warnings in the Kafka CR status. You need to prevent thrashing or multiple warnings piling up for the same issue.

The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode.
The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account.

The generated `KafkaRebalance` custom resource will be called `<my-cluster-name>-auto-rebalancing-imbalance-<anomalyId>`, where the `<my-cluster-name>` part comes from the `metadata.name` in the `Kafka` custom resource, `imbalance` refers to applying the rebalance to all the brokers, and the `<anomalyId>` would be retrieved from the notifier.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, there is a 253 character limit and a long cluster name (I have seem some very long ones in the wild) combined with the anomalyID could conceivably push you over that limit?

Also, I know it is unlikely but it is possible for there to be multiple anomalies in ConfigMap. In that case which one is used for the naming?

Ultimately, is the anomalyID in the name actually helpful, can the user use it for anything? As I understand it, we will only ever have one auto-generated KR CR of type imbalance at any one time (or could one stick around, perhaps after an error?) so why not a fixed suffix after the kafka cluster name?

metadata:
name: my-cluster-goal-violations
data:
fixableAnomalies: |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I wonder if it would be better to use a JSON format here, rather than this nested YAML structure? I would be less susceptible to parsing issues with indentations etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomncooper please take a look at this comment here #161 (comment) I tagged you and others because I think we should change the structure to be more anomaly IDs driven.

The default anomaly detection goals set by the operator are `RACK_AWARENESS_GOAL`, `MIN_TOPIC_LEADERS_PER_BROKER_GOAL`, `REPLICA_CAPACITY_GOAL`, `DISK_CAPACITY_GOAL`.
These are similar to the default goals used in for KafkaRebalance if the users don't mention the rebalance goals.
If the user specifies a rebalance template then the goals mentioned in the rebalance template will be validated against the anomaly detection goals being used by Cruise Control.
The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control.
Copy link

@tomncooper tomncooper Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not subset, a superset. The goals in the KR template need to include all the anomaly detection goals but they could have other goals in there too.

Suggested change
The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control.
The goals in the rebalance template should be either the same or be a superset of the anomaly detection goals being used by Cruise Control.

Comment on lines +343 to +344
Once the warning is issued, the operator will clear the `unfixableAnomalies` list.
In case the user doesn't fix the anomaly, the `unfixableAnomalies` list will be populated again with the unfixable goal violation, and it will keep happening until the user fixes the anomaly manually.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the Warnings be removed from the Kafka CR status? I can see how we need to prevent warnings for the same violation piling up in the status.

Could we add some logic to dedup the warnings. So if a warning based on DiskDistributionGoal has been issued in the last N hours, don't post it again?

The `fixableAnomalies` list will state the `anomalyId` of the anomaly as well as the fixable violated goals.
The `fixableAnomalies` list will be populated by the notifier every time it detects an anomaly.
Whenever a rebalance is about to be triggered by the operator the `fixableAnomalies` list will be cleared.
This will be the case for manual rebalances triggered by the user, as well as auto-rebalance rebalances that are happening in `add-broker`, `remove-broker`, or `imbalance` mode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there needs to be more discussion of how the anomaly CM is handled. Much like the various FSM states you need to step through the lifecycle of the CM and make sure their aren't any race conditions that could occur between the notifier (writing) and the operator (reading and clearing).

Comment on lines +239 to +241
The advantages of using a single ConfigMap for all the detected anomalies versus one per anomaly are:
1. It provides a single point of reference for all the anomalies
2. When dealing with multiple Kafka clusters, the number of ConfigMap(s) created will be equal to the number of clusters deployed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scholzj I am interested in what you mean by "a more event driven approach"? The core issue is the communication of detected anomalies between the Notifier running in CC and the operator. There are many ways to solve that, the CM being one of them. I am interested in what other options there might be, we already looked at a dedicated KafkaAnomaly CR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.