-
Notifications
You must be signed in to change notification settings - Fork 75
Added proposal for auto-rebalance on imbalanced cluster feature in operator #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ShubhamRwt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass. I think this is a better proposal which is more in line with how Strimzi currently uses CC.
I think you need more detail on the interaction with the current auto-rebalancing and also a clearer description of the FSM states and their transitions. I found it hard to follow the sequence you are proposing.
For the notifier, I actually think we should stop users using custom notifiers (we could make it conditional on the full mode being set or not). As we are creating K8s resources in response to detected anomalies users can create alerting based on that if they need it. If users do need that then we could provide implementations of the various notifiers which extend our notifier rather than the CC one.
scholzj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going in the right direction. But I think it needs to go a bit deeper:
- We need to establish our own terminology and not take over the Cruise Control one. There is not really any self-healing and most of the anomalies are not really anomalies.
- If I read this proposal right, you want to focus on when the cluster is out-of-balance. That is a great start. But perhaps that should not be called
mode: full? Calling itfullseems confusing - does it mean thatfullincludes scale-up / scale-down? Also, I guess in the future we would add some actual self-healing to handle the broken disks or brokers. That might create additional modes probably. So maybemode: rebalanceormode: skewor something like that would make more sense?
|
@scholzj good to know that you like the track we are now :-) Regarding the "full" related naming, we were just reusing the underneath mode naming for the |
I do not think this works here. The
|
Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
tomncooper
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I did another pass. I have a few questions:
- How are you going to distinguish anomaly CMs from different Kafka clusters in the same namespace. I know it is not recommened, but user do deploy multiple Kafka clusters in the same NS.
- You need to deal with GC'ing all these anomaly CMs in the case where a rebalance is on going. Do you delete them? Do you have some kind of timeout based on the detection interval?
- It is not clear what you mean by scale up/down auto-rebalances being queued up? I assume you mean generated
KafkaRebalanceCRs? But it is not clear.
|
Chiming in as an end user - glad to see this proposal! We have been debating internally if we want to have a cronjob to issue rebalances, this is a lot better. In particular the model of using CruiseControl's anomaly detection while issuing the rebalances through KafkaRebalance CRs seems like it will fit perfect into our workflows. I see discussion on how to represent the anomalies. Any solution here is fine for us, I envision we will mostly be interacting with the KafkaRebalance CR and not much with anything else. An area that could be explicit is the right way to stop all rebalances and not issue any more. Rebalance operations often saturate bandwidth, either disk or network, and cause major latency during producing. We often find ourselves needing to cancel them as we scale and learn our limits. It looks like we might be able to delete Thanks for putting this together, excited to see this. |
|
@nickgarvey Thanks for the feedback! Usually you are able to stop the current rebalancing by applying the |
Signed-off-by: ShubhamRwt <[email protected]>
tomncooper
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have had another pass, mostly just suggestions around language and grammar.
However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.
tomncooper
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have had another pass, mostly just suggestions around language and grammar.
However, you need a much clearer description of what happens when anomalies are detected when a rebalance is ongoing (either auto or manually triggered). Why is that an issue, what scenarios could arise, how does having the timestamps help avoid that. Also, what exactly happens to the anomaly list, when is it updated, when is it cleared etc.
Signed-off-by: ShubhamRwt <[email protected]>
9eff83e to
9453f7f
Compare
Signed-off-by: ShubhamRwt <[email protected]>
|
Hello @scholzj @ppatierno @tinaselenge @kyguy @tomncooper @katheris @im-konge @see-quick @Frawless . I have addressed and resolved all the comments on the proposal. The proposal is ready for review again. It would be great if you can find some time to have another pass. |
tomncooper
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The section on the StrimziCruiseControlNotifier, particularly the CM format needs review. It is not clear what the unfixable anomalies/goal list is actually going to contain? You also need to make clear what happens if all the detected anomalies are unfixable?
tinaselenge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update Shubham. It's been a while since the last time I read the proposal, and can see quite a lot changed. I like the proposed solution to use the auto-rebalancing and let the operator handle the healing.
Overall I'm happy with the proposal, but I think there are a few parts that need to be ironed out with more clarity.
Signed-off-by: ShubhamRwt <[email protected]>
katheris
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ShubhamRwt. I've added some formatting suggestions and a couple of clarifying questions.
The only other area I wondered about was whether we should consider making use of the maintenance windows? So for example not triggering a rebalance unless we are within a maintenance window. What do you think @scholzj ?
Signed-off-by: ShubhamRwt <[email protected]>
tinaselenge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks really good to me. I left a couple of questions to clarify my understanding. Thanks!
Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
63a502a to
de6c286
Compare
katheris
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my previous comments @ShubhamRwt. I added one more related to how the user can clear the unfixable list in the ConfigMap.
I also saw a bunch of discussion about FSM and want to confirm my understanding. So from what I understand basically the RebalanceOnImbalance state is only reachable from Idle. This is because when a new anomaly is detected if there is a rebalance currently ongoing it might fix the anomaly, so we don't ever queue an imbalance related rebalance, is that right?
Then if an imbalance related rebalance is ongoing and a scale up or scale down is needed we queue those in the same way we do today.
Am I understanding all of that right?
|
Yes @katheris you are right for the case where we are reachable to |
Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ShubhamRwt for addressing the comments. I am happy with the proposal, +1 non-binding
ppatierno
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShubhamRwt we are almost there ... keep doing the nice work on it. I left few comments but by addressing them I will approve the proposal.
| Once the warning is issued, the operator will clear the `unfixableAnomalies` list. | ||
| In case the user doesn't fix the anomaly, the `unfixableAnomalies` list will be populated again with the unfixable goal violation, and it will keep happening until the user fixes the anomaly manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right still, on subsequent reconciliations how will the operator know whether to keep the warning in the status. Shouldn't we clear the list of unfixable anomalies at the same time as clearing the fixable anomalies? So after a rebalance has happened? That way the operator knows on each reconciliation whether to leave the warning there, or to remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point @katheris.
First, let me highlight that when an anomaly has unfixable goals, there is no rebalance going to happen so we can't clear the list "after a rebalance has happened".
At the same time we need anyway to clear the anomaly from the list because, unluckily, when the same anomaly is detected by Cruise Control later on, it will got assigned a different ID. So if we don't delete the previous one (which is actually the same anomaly but with a different ID), we'll get the ConfigMap growing indefinitely.
Right now I can't see a good solution unless not adding the warning in the Kafka CR but just a log into the Kafka operator that an anomaly was ignored because not fixable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ppatierno I know that a rebalance won't be triggered for the unfixable anomalies, however for things like disk failures it's somewhat likely that a rebalance would be triggered by the user after the disk is replaced, so if we are going to tie clearing the list to anything then a rebalance seems an ok choice.
In terms of the ID, thanks that's useful information I wasn't aware of. So I guess it will be same type of violation but a different ID? Could we just list the violation in the warning and in the ConfigMap when we detect a new violation remove the old one?
That still doesn't fix the problem of the warning being constantly added and removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the Warnings be removed from the Kafka CR status? I can see how we need to prevent warnings for the same violation piling up in the status.
Could we add some logic to dedup the warnings. So if a warning based on DiskDistributionGoal has been issued in the last N hours, don't post it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if we are going to tie clearing the list to anything then a rebalance seems an ok choice.
What if the user for any reasons doesn't start a rebalancing? The operator won't clear the list, CC will continue to report the same anomaly but with a different ID by adding it to the CM which will grow indefinitely.
Could we just list the violation in the warning and in the ConfigMap when we detect a new violation remove the old one?
I can imagine the following scenario. Anomaly ID = 1 brought goal violation A and B (not fixable). Then CC detects again anomaly B but together with another goal violation C and all are brought with anomaly ID = 2. I don't think that operator should delete anomaly ID = 1 from ConfigMap because the same violation B was reported by anomaly ID = 2 as well. We would lost violation A. Unless it was fixed? 🤔 It's really tricky.
Signed-off-by: ShubhamRwt <[email protected]>
Signed-off-by: ShubhamRwt <[email protected]>
| ## Motivation | ||
|
|
||
| Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource. | ||
| With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is very time-consuming about this for larger clusters? Isn't the process exactly the same for small and large cluster when you use KafkaRebalance? It just takes longer between the steps. But that does not mean the process itself is more time-consuming.
| * `CHECK` - Check the anomaly at a later time | ||
| * `IGNORE` - Ignore the anomaly | ||
|
|
||
| The default notifier enabled by Cruise Control is the `NoopNotifer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The default notifier enabled by Cruise Control is the `NoopNotifer`. | |
| The default notifier enabled by Cruise Control is the `NoopNotifer`. |
| We propose using only those anomaly detection classes, related to goal violations, that can be addressed by a partition rebalance. | ||
| We will not enable the other anomaly detection classes, related to goal violations, that would require manual interventions at the infrastructure level such as disk or broker failures. | ||
| For the latter case, it would be better to spin up new disks or to fix the issue with the broker(s) directly instead of just moving the partitions replicas away from them through rebalancing. | ||
| Therefore, given the interventions required, it would be non-trivial for the Strimzi Operator to fix these failures automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it is question of triviality. The main reason to spin it off into separate proposal is that the solution for them is NOT running a rebalance.
| If the user specifies a rebalance template then the goals mentioned in the rebalance template will be validated against the anomaly detection goals being used by Cruise Control. | ||
| The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control. | ||
| This is to ensure rebalances are only being run with goals that Cruise Control is checking for anomalies. | ||
| If the goals mentioned in the templates are not a subset of the configured anomaly detection goals then we will add a warning condition in the Kafka CR regarding the failed validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what is the point of the template if the user seems to be unable to configure the goals really?
| The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode. | ||
| The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account. | ||
|
|
||
| The generated `KafkaRebalance` custom resource will be called `<my-cluster-name>-auto-rebalancing-imbalance-<anomalyId>`, where the `<my-cluster-name>` part comes from the `metadata.name` in the `Kafka` custom resource, `imbalance` refers to applying the rebalance to all the brokers, and the `<anomalyId>` would be retrieved from the notifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the anomaly ID look like? It is important to be able to judge the maximal length of the names ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, there is a 253 character limit and a long cluster name (I have seem some very long ones in the wild) combined with the anomalyID could conceivably push you over that limit?
Also, I know it is unlikely but it is possible for there to be multiple anomalies in ConfigMap. In that case which one is used for the naming?
Ultimately, is the anomalyID in the name actually helpful, can the user use it for anything? As I understand it, we will only ever have one auto-generated KR CR of type imbalance at any one time (or could one stick around, perhaps after an error?) so why not a fixed suffix after the kafka cluster name?
| In case we get an anomaly which has both fixable goals and unfixable goals, we will still `IGNORE` that anomaly. | ||
| This is because Cruise Control won't be able to generate a proposal that can fix all the configured goals for rebalance. | ||
|
|
||
| The created ConfigMap will persist when `imbalance` mode is enabled for auto-rebalance and will be deleted if the user decides to remove `imbalance` mode from `spec.kafka.cruiseControl.autoRebalance`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, above you said it would be deleted before the rebalance starts 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above I only said that the fixable anomalies list will be cleared. I don't see where I mentioned that the Configmap would be delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, clearly hat was not really how I understood the text. That also does not solv the issue of conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me which text you are referring to? I can fix it if it sounds confusing
| The advantages of using a single ConfigMap for all the detected anomalies versus one per anomaly are: | ||
| 1. It provides a single point of reference for all the anomalies | ||
| 2. When dealing with multiple Kafka clusters, the number of ConfigMap(s) created will be equal to the number of clusters deployed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get why these two advantages matter. But I think the lifecycle of the config map as you describe it below is a mess and will not work well.
If nothing else, you should properly describe it and not have it spread through the text. Maybe it would work and is just badly described. But I wonder if it is really the case an whether a more event driven approach is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? We are only maintaining one Configmap per cluster. The configmap will always exist unless the imbalance mode for autorebalance is disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scholzj I am interested in what you mean by "a more event driven approach"? The core issue is the communication of detected anomalies between the Notifier running in CC and the operator. There are many ways to solve that, the CM being one of them. I am interested in what other options there might be, we already looked at a dedicated KafkaAnomaly CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomncooper While I understand the advantage of the single Config Map, I do not think it would work the way it is designed here. The worry is not just race conditions in the sense of losing some information, as I assume unfixed anomalies might be raised again at some point. But just the handling of 409 conflicts will be a mess which you would need to handle somehow and that somehow will likely be failed reconciliation.
So I think it needs either ...:
- A clearly defined ownership where either the anomalies and their state are owned by CC and it is responsible for cleaning them (the question is, does that really need the CM? Can't REST API do this?)
- Alternatively, if we do not think CC can manage the whole lifecycle (which might be the case, I do not feel like I have the expertise to say that for sure), then we should utilize a pattern where CC produces the events and CO consumes them in a safe fashion.
- That could be for example stream of Config Maps being produced by CC and consumed (deleted) by CO. They could for example have the names based on some timestamps (
<cluster-name>-cv-2025-12-16-17-08-00). This might have its own challenges (for example we need some clear back-off interval how often CC produces these config maps). But the CO would be simply able to delete these as it processes them. - Obviously, with Kafka at hand and CC already using Kafka in similar way, one could also think about utilizing the Kafka cluster for this. For example by having some single partition topic and tracking offsets in the Kafka CR status. But this has its own challenges as well I guess to not confuse users with some internal consumer groups, size on disk, keeping the right replication factor, Having the operator blocked when the topic is unavailable, etc.
- That could be for example stream of Config Maps being produced by CC and consumed (deleted) by CO. They could for example have the names based on some timestamps (
I do not think the dedicated CR solves this. A new CRD can maybe have the advantage of the .status section managed by the CO and the .spec section managed by CC. But even that produces 409 errors during parallel updates. But overall, I think this is more about how you create and manage the resources rather than what type of resource you have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could separate the access paths by having the CC notifier append new anomalies to the data in the CM, these could have a timestamp attached and would form a log of recently detected anomalies.
The operator would only ever read the CM data and when it reacts to an anomaly it attaches an annotation to the CM with the timestamp (and ID maybe) of the anomaly. The operator would use the annotation as a way to see if new anomalies have been added. The operator only ever reads the CM data and edits the annotation.
The CC notifier just appends anomalies, it would also have a GC function to clear up anomalies that are over a certain age. This has the added advantage of making it easy for a human operator to see what anomalies have been detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to Jakubs point, can the CO just poll the CC REST API directly and see what anomalies have been detected (with a timestamp)? We could store a "last seen anomaly" somewhere and compare those to the list of detected anomalies.
@ShubhamRwt does the API have that kind of detail or does it have to go through the Notifier? Would we lose info like fixable vs unfixable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the API have that kind of detail or does it have to go through the Notifier? Would we lose info like fixable vs unfixable?
I cannto answer that, but I guess in theory the notifier can have its own REST API for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conflict between the notifier and the operator to update the same CM is a good point. It would be of course the same with a dedicated CR (as already pointed out by Jakub).
We could think about a "pull" based mechanism (compared to the "push" one we have right now) for the operator to get information about the detected anomalies.
Cruise Control exposes the /kafkacruisecontrol/state in order to get the anomaly_detector substate which is described by the following OpenAPI spec https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/resources/yaml/responses/anomalyDetectorState.yaml
The AnomalyDetails object contains pretty much what we need (with fixable and unfixable violations) with a status which would always be IGNORED in our case (set by the notifier). But it's important to notice that what you get is a recentGoalViolations list ... and IIRC from tests I ran in the past, old goal violations stay there for some time (even if I can't see a FIXED status enum, so maybe the fixed ones go away). Anyway, the operator should understand which ones are now fixed or not by running its own rebalancing. Or which ones are currently being fixed because a rebalancing is running.
Other than that, we need a way for the notifier to notify the operator to take a look at that endpoint because some anomalies were detected. At this point maybe a dedicated Kafka topic could be a solution (?).
To avoid this, it could be the operator to pull from that endpoint on each reconciliation so making the notifier totally useless (the NooP one is good enough just to set IGNORE for any anomaly) and having a full pull based mechanism from the operator perspective. But again how the operator could understand from the "recent" list of goal violations which ones need to be addressed or the ones already addressed (if they stay there for longer, as part of the history). Also the operator reconciliation period would not match with the anomaly detector period (which is when the anomaly is detected and a new status could be reported when the API is hit by the operator).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than that, we need a way for the notifier to notify the operator to take a look at that endpoint because some anomalies were detected. At this point maybe a dedicated Kafka topic could be a solution (?).
I think checking the API every reconciliation is easier. If needed, you set some backoff interval to for example wait for at least 15 minutes after last auto-rebalance before checking it or before acting on it. Especially if you need to give CC time to figure out what was resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the Anomaly detector state looks like
AnomalyDetectorState: {selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY, TOPIC_ANOMALY, MAINTENANCE_EVENT], selfHealingDisabled:[], selfHealingEnabledRatio:{BROKER_FAILURE=1.0, DISK_FAILURE=1.0, GOAL_VIOLATION=1.0, METRIC_ANOMALY=1.0, TOPIC_ANOMALY=1.0, MAINTENANCE_EVENT=1.0}, recentGoalViolations:[{anomalyId=c2071b83-b011-4924-8fa9-8d3cb0b2ebb9, detectionDate=2024-12-17T09:26:00Z, statusUpdateDate=2024-12-17T09:26:04Z, fixableViolatedGoals=[DiskUsageDistributionGoal], unfixableViolatedGoals=[], status=FIX_STARTED}], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], recentTopicAnomalies:[], recentMaintenanceEvents:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:3.07 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds, DISK_FAILURE:0.00 milliseconds, TOPIC_ANOMALY:0.00 milliseconds}, meanTimeToStartFix:2.89 seconds, numSelfHealingStarted:1, numSelfHealingFailedToStart:0, ongoingAnomalyDuration=0.00 milliseconds}, ongoingSelfHealingAnomaly:None, balancednessScore:88.450}
The recentGoalViolations will have the anomalies even if they are resolved. I am not sure how we will differentiate if the violation is fixed or not. The only thing we could know is that whether a fix has started for the anomaly or not. I will try to see when this list gets cleared by running some tests but maybe the only way to clear the list would be when CC pod gets rolled again
| The detected <anomaly> has unfixable `DiskDistributionGoal` goal. | ||
| ``` | ||
|
|
||
| Once the warning is issued, the operator will clear the `unfixableAnomalies` list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this works given the nature of the condistions in Kafka CR. They need to be treated as ephemeral.
| Therefore, we will add a new metric of counter type named `anomaly_detection_metrics` which will be updated whenever an anomaly is detected by the anomaly detector. We will also add labels to it to differentiate them based on fixability and type of anomaly. | ||
| These metrics will be exposed by the operator and will be only available when the user has configured auto-rebalance on `imbalance` mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have more details because it does not seem like you can provide this metric in a reliable fashion from how you treat the anomalies. Or maybe you don't sufficiently explain what the metric will show ...
| 2. The `KafkaAutoRebalancingReconciler.reconcile()` will then check if there was any ConfigMap created with name `<cluster-name>-goal-violations` and whether the `fixableAnomalies` and `unfixableAnomalies` list is empty or not, then the `full` rebalance (imbalance mode) would be performed if the `unfixableAnomalies` list is empty and the `fixableAnomalies` list is not empty. | ||
| 3. If a rebalance is already ongoing and more anomalies are detected, then the operator will just ignore the new anomalies and delete all the anomalies from the `fixableAnomalies` list in the ConfigMap. | ||
| 4. Any anomalies that are not resolved by the ongoing rebalance will be redetected by the anomaly detector once the FSM returns to the `Idle` state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cycle of deleting and recreating the config maps by the CO and the notifier probably needs some time based limits? How often will the notifier recreate the CM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we are never deleting the configmap unless we disable the imbalance mode for autorebalance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have had another pass. I think the main issue is that, as Jakub highlighted, the lifecycle of the Anomaly CM is quite complicated.
I think the proposal would benefit from a description of the states the ACM will move through and what happens under different scenarios. When you have two process writing to and reading from the same source concurrently, you need to make sure there aren't any race conditions that could bite you. I suspect the relatively long anomaly detection period will help here but it is worth being methodical.
I also think you may need a more sophisticated approach to the rebalance "completed with error" warnings in the Kafka CR status. You need to prevent thrashing or multiple warnings piling up for the same issue.
| The `KafkaRebalance` has 4 modes: `full`, `add-broker`, `remove-broker` and `remove-disks` mode. | ||
| The `imbalance` mode will be mapped to the `full` mode in the generated KafkaRebalance resource which means that generated `KafkaRebalance` custom resource will have the mode set as `full` which within the Strimzi rebalancing operator means calling the Cruise Control API to run a rebalancing taking all brokers into account. | ||
|
|
||
| The generated `KafkaRebalance` custom resource will be called `<my-cluster-name>-auto-rebalancing-imbalance-<anomalyId>`, where the `<my-cluster-name>` part comes from the `metadata.name` in the `Kafka` custom resource, `imbalance` refers to applying the rebalance to all the brokers, and the `<anomalyId>` would be retrieved from the notifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, there is a 253 character limit and a long cluster name (I have seem some very long ones in the wild) combined with the anomalyID could conceivably push you over that limit?
Also, I know it is unlikely but it is possible for there to be multiple anomalies in ConfigMap. In that case which one is used for the naming?
Ultimately, is the anomalyID in the name actually helpful, can the user use it for anything? As I understand it, we will only ever have one auto-generated KR CR of type imbalance at any one time (or could one stick around, perhaps after an error?) so why not a fixed suffix after the kafka cluster name?
| metadata: | ||
| name: my-cluster-goal-violations | ||
| data: | ||
| fixableAnomalies: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I wonder if it would be better to use a JSON format here, rather than this nested YAML structure? I would be less susceptible to parsing issues with indentations etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomncooper please take a look at this comment here #161 (comment) I tagged you and others because I think we should change the structure to be more anomaly IDs driven.
| The default anomaly detection goals set by the operator are `RACK_AWARENESS_GOAL`, `MIN_TOPIC_LEADERS_PER_BROKER_GOAL`, `REPLICA_CAPACITY_GOAL`, `DISK_CAPACITY_GOAL`. | ||
| These are similar to the default goals used in for KafkaRebalance if the users don't mention the rebalance goals. | ||
| If the user specifies a rebalance template then the goals mentioned in the rebalance template will be validated against the anomaly detection goals being used by Cruise Control. | ||
| The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not subset, a superset. The goals in the KR template need to include all the anomaly detection goals but they could have other goals in there too.
| The goals in the rebalance template should be either the same or be a subset of the anomaly detection goals being used by Cruise Control. | |
| The goals in the rebalance template should be either the same or be a superset of the anomaly detection goals being used by Cruise Control. |
| Once the warning is issued, the operator will clear the `unfixableAnomalies` list. | ||
| In case the user doesn't fix the anomaly, the `unfixableAnomalies` list will be populated again with the unfixable goal violation, and it will keep happening until the user fixes the anomaly manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the Warnings be removed from the Kafka CR status? I can see how we need to prevent warnings for the same violation piling up in the status.
Could we add some logic to dedup the warnings. So if a warning based on DiskDistributionGoal has been issued in the last N hours, don't post it again?
| The `fixableAnomalies` list will state the `anomalyId` of the anomaly as well as the fixable violated goals. | ||
| The `fixableAnomalies` list will be populated by the notifier every time it detects an anomaly. | ||
| Whenever a rebalance is about to be triggered by the operator the `fixableAnomalies` list will be cleared. | ||
| This will be the case for manual rebalances triggered by the user, as well as auto-rebalance rebalances that are happening in `add-broker`, `remove-broker`, or `imbalance` mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there needs to be more discussion of how the anomaly CM is handled. Much like the various FSM states you need to step through the lifecycle of the CM and make sure their aren't any race conditions that could occur between the notifier (writing) and the operator (reading and clearing).
| The advantages of using a single ConfigMap for all the detected anomalies versus one per anomaly are: | ||
| 1. It provides a single point of reference for all the anomalies | ||
| 2. When dealing with multiple Kafka clusters, the number of ConfigMap(s) created will be equal to the number of clusters deployed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scholzj I am interested in what you mean by "a more event driven approach"? The core issue is the communication of detected anomalies between the Notifier running in CC and the operator. There are many ways to solve that, the CM being one of them. I am interested in what other options there might be, we already looked at a dedicated KafkaAnomaly CR.
This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal #145. This proposal aim to utilize the
auto-rebalancingfeature of Strimzi to introduce the self healing.