Skip to content

Conversation

@griffindvs
Copy link

@griffindvs griffindvs commented Oct 28, 2025

This PR introduces a proposal for Kafka rack awareness where node pools are assigned to racks/availability zones.

I have created a prototype implementation here. I have used this prototype with the following configuration:

Kafka CR:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  annotations:
    strimzi.io/kraft: enabled
    strimzi.io/node-pools: enabled
  name: my-kafka
  namespace: strimzi
spec:
  kafka:
    rack:
      type: envvar
...

Three KafkaNodePool CRs, one for each zone, according to the following format:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  labels:
    strimzi.io/cluster: my-kafka
  name: zoneX
  namespace: strimzi
spec:
  replicas: Y
  roles:
  - broker
  - controller
  template:
    kafkaContainer:
      env:
        - name: STRIMZI_RACK
          value: zone0
    pod:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    strimzi.io/cluster: my-kafka
                    strimzi.io/pool-name: zoneX
                topologyKey: topology.kubernetes.io/zone
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 90
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: strimzi.io/cluster
                  operator: In
                  values:
                  - my-kafka
                - key: strimzi.io/pool-name
                  operator: NotIn
                  values:
                  - zoneX
              topologyKey: topology.kubernetes.io/zone
          - weight: 80
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  strimzi.io/cluster: my-kafka
              topologyKey: kubernetes.io/hostname

An example using five brokers and three zones:

kubectl get kafkanodepool.kafka -n strimzi
NAME    DESIRED REPLICAS   ROLES                     NODEIDS
zone0   2                  ["controller","broker"]   [0,1]
zone1   2                  ["controller","broker"]   [2,3]
zone2   1                  ["controller","broker"]   [4]
NAME                                       ZONE         NODE
my-kafka-entity-operator-fbbc6859-fpr8q    Raleigh      worker0.example.com
my-kafka-zone0-0                           ChapelHill   worker2.example.com
my-kafka-zone0-1                           ChapelHill   worker5.example.com
my-kafka-zone1-2                           Durham       worker1.example.com
my-kafka-zone1-3                           Durham       worker4.example.com
my-kafka-zone2-4                           Raleigh      worker3.example.com
strimzi-cluster-operator-558d7b695-th8mv   ChapelHill   worker5.example.com
Metadata for all topics (from broker -1: sasl_ssl://localhost:9094/bootstrap):
 5 brokers:
  broker 0 at my-kafka-zone0-0-strimzi.example.com:443
  broker 1 at my-kafka-zone0-1-strimzi.example.com:443 (controller)
  broker 2 at my-kafka-zone1-2-strimzi.example.com:443
  broker 3 at my-kafka-zone1-3-strimzi.example.com:443
  broker 4 at my-kafka-zone2-4-strimzi.example.com:443
 1 topics:
  topic "my-topic" with 5 partitions:
    partition 0, leader 4, replicas: 4,1,2, isrs: 2,4,1
    partition 1, leader 3, replicas: 1,3,4, isrs: 3,4,1
    partition 2, leader 2, replicas: 2,4,1, isrs: 2,4,1
    partition 3, leader 4, replicas: 4,1,3, isrs: 3,4,1
    partition 4, leader 3, replicas: 1,3,4, isrs: 3,4,1

Sample broker config:

  server.config: |-
    ##########
    # Node ID
    ##########
    node.id=0

    ##########
    # Rack ID
    ##########
    broker.rack=${strimzienv:STRIMZI_RACK}

...

This proposal maintains CRD compatibility by introducing a new, optional field.
All existing configurations would continue to be valid and maintain their existing behavior.

## Rejected alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep in mind that there are also some new Kubernetes features coming to the downward API as discussed in strimzi/strimzi-kafka-operator#11504. If nothing else, that should be mentioned here. But likely we might want to wait for how it turns out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jakub, I have this issue linked/mentioned as the second rejected alternative on line 126 https://github.com/strimzi/proposals/pull/184/files#diff-1028926b15ea23de0e061f3d424bf901a8af3d9ef217de0071b7831c0c0574d3R126

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we are not rejecting the feature. That is where we are heading because that has the potential to actually reduce the need for the RBAC rights without crippling the UX. So your proposal should not reject it. It should instead explain/convince people why it is worth investing and maintaining yet another way to do it that has inferior UX and we will not be able to get rid of easily as the better solution matures. (Given in some edge cases the Kubernetes solution might not be fully self-sufficient, it might be the 3rd way to do the same thing -> that is a LOT of maintenance effort, testing effort, etc. I would argue that Strimzi cannot afford that ... so you need to have a really solid motivation and use-cases.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this alternative to a new section "Available alternatives" (instead of rejected) and outlined a few benefits of this proposal over that possible future enhancement: d0dbdb1

@griffindvs griffindvs changed the title Add Pool-based Rack Awareness proposal Add Node Pool Rack IDs proposal Dec 15, 2025
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the APIs should look a bit differently ...

  • The Kafka CR needs to have a clear definition how the rack awareness is configured. You can use the typed APIs for it probably. E.g. for the current mode:
    rack:
      type: node-based
      topologyKey: topology.kubernetes.io/zone
    And for the new mode
    rack:
      type: envvar
    With the default being always node-based if type is not specified.
    You would need to check in the code how exactly this can be implemented as this missed the v1 API window where it could have had a clean implementation.
  • When type: envvar is used, the rack will be taken from the environment variable. I'm not sure if its name should be configurable or hardcoded - something to think about. But you will not need to think about all the various options future might need for the ¨rackId` API. And environment variables can be configured already today in the Node Pool API.

implementation are prohibitive.
Many Kubernetes cluster administrators may restrict access to cluster-scoped Kubernetes
resources to ensure an application and the user managing it are contained within a limited set of namespaces.
Today, Strimzi only requires access to cluster-scoped Kubernetes resources for rack-awareness and NodePort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is required for storage resizing as well as we need to read storage classes for that.

Comment on lines 28 to 29
Implementing the proposed method for pool-based rack awareness removes these potentially prohibitive
requirements while maintaining a simple deployment model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not remove these requirements -> it just makes it optional in one of the mentioned cases.

labels:
strimzi.io/cluster: my-cluster
spec:
rackId: zone0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you rely on an environment variable rather than a dedicated API? That is probably more future proof and does not need an API change.

Comment on lines 49 to 53
When using a pool rack ID, users must configure pod affinity and anti-affinity in the Kafka
pod template to ensure:

* Brokers within pools with the same rack ID are scheduled in the same availability zone
* Brokers in pools with different rack IDs are scheduled in different availability zones
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this example. But I would reword it:

  • Strictly speaking, users should always configure affinity / topology spread constraints when using any rack awareness
  • One of the reason why this feature is interesting is that users do not necessarily need to define these rules. They can just control the rack ID for testing purposes etc.

@griffindvs
Copy link
Author

Hi @scholzj in 54da00c I've modified the API based on your suggestion.

I used node-label instead of node-based as the type name for the current behavior, but I'm happy to change that as needed. I also chose to use a hard-coded environment variable, STRIMZI_RACK. I thought it would be best to keep the implementation simple and API changes minimal unless there's a demonstrated need for custom environment variables.

I also made changes for your other comments.

Please let me know your thoughts. Thank you!

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some more nits ... but I think the core of the proposal is now pretty good:

  • It gives users a lot of flexibility
  • It is easy to customize through various tools (such as injecting the environment variable from webhooks etc.)
  • It does not put too much load on the API

Thanks for incorporating the comments!

* This rack type maintains the existing behavior where rack IDs are configured using a node label
* The node label is determined using the existing `topologyKey` field
* `envvar`
* This rack type uses the `STRIMZI_RACK` environment variable in the broker container to populate the rack ID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick thoughts ... would STRIMZI_RACK_ID be better? Alternatively, should we make the environment variable name configurable?

labels:
strimzi.io/cluster: my-cluster
spec:
rackId: zone0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should replace this with the environment variable definition.

* `envvar`
* This rack type uses the `STRIMZI_RACK` environment variable in the broker container to populate the rack ID

When a `topologyKey` is defined, the default rack type will be `node-label` to maintain existing behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ideally make it either/or ...

  • You either have topologyKey and optional type: node-label
  • Or yu need to have type: envvar

I think the CEL validation rules can help to validate this. But we want to avoid something like this:

spec:
  kafka:
    rack: {}

Comment on lines +21 to +26
For users interested in a heightened security posture, the requirements of the current rack-awareness
implementation are prohibitive.
Many Kubernetes cluster administrators may restrict access to cluster-scoped Kubernetes
resources to ensure an application and the user managing it are contained within a limited set of namespaces.
Today, Strimzi requires access to cluster-scoped Kubernetes resources for rack-awareness, NodePort
listener configuration, and reading StorageClasses for volume resizing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I do not think it really improves security in any meaning ful way. So I would leave out the first sentence. Also, I would mention some other situations this might be useful:

Suggested change
For users interested in a heightened security posture, the requirements of the current rack-awareness
implementation are prohibitive.
Many Kubernetes cluster administrators may restrict access to cluster-scoped Kubernetes
resources to ensure an application and the user managing it are contained within a limited set of namespaces.
Today, Strimzi requires access to cluster-scoped Kubernetes resources for rack-awareness, NodePort
listener configuration, and reading StorageClasses for volume resizing.
Today, Strimzi requires access to cluster-scoped Kubernetes resources for rack-awareness, NodePort
listener configuration, and reading StorageClasses for volume resizing.
But some Kubernetes cluster administrators may restrict access to cluster-scoped Kubernetes
resources to ensure an application and the user managing it are contained within a limited set of namespaces.
In such situations, Strimzi users were not able to use the rack-awareness feature.
With this proposal, they will be able to configure and use rack-awareness even when Strimzi does not have the access rights to create `ClusterRolesBinding` resources.
This feature might also be useful in other situations such as:
* In testing, where it allows to configure rack-awareness independently of the underlying infrastructure (it, for example, allows to test rack-awareness related features on a single-node Kubernetes cluster by manually defining different racks for different node pools)
* In future stretch-cluster environments, where automatically configured rack-awareness might not be desired (for example, when moving Kafka nodes between two Kubernetes clusters)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants