Add Node Pool Rack IDs proposal #184

griffindvs · 2025-10-28T14:27:01Z

This PR introduces a proposal for Kafka rack awareness where node pools are assigned to racks/availability zones.

I have created a prototype implementation here. I have used this prototype with the following configuration:

Kafka CR:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  annotations:
    strimzi.io/kraft: enabled
    strimzi.io/node-pools: enabled
  name: my-kafka
  namespace: strimzi
spec:
  kafka:
    rack:
      type: envvar
...

Three KafkaNodePool CRs, one for each zone, according to the following format:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  labels:
    strimzi.io/cluster: my-kafka
  name: zoneX
  namespace: strimzi
spec:
  replicas: Y
  roles:
  - broker
  - controller
  template:
    kafkaContainer:
      env:
        - name: STRIMZI_RACK
          value: zone0
    pod:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    strimzi.io/cluster: my-kafka
                    strimzi.io/pool-name: zoneX
                topologyKey: topology.kubernetes.io/zone
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 90
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: strimzi.io/cluster
                  operator: In
                  values:
                  - my-kafka
                - key: strimzi.io/pool-name
                  operator: NotIn
                  values:
                  - zoneX
              topologyKey: topology.kubernetes.io/zone
          - weight: 80
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  strimzi.io/cluster: my-kafka
              topologyKey: kubernetes.io/hostname

An example using five brokers and three zones:

kubectl get kafkanodepool.kafka -n strimzi
NAME    DESIRED REPLICAS   ROLES                     NODEIDS
zone0   2                  ["controller","broker"]   [0,1]
zone1   2                  ["controller","broker"]   [2,3]
zone2   1                  ["controller","broker"]   [4]

NAME                                       ZONE         NODE
my-kafka-entity-operator-fbbc6859-fpr8q    Raleigh      worker0.example.com
my-kafka-zone0-0                           ChapelHill   worker2.example.com
my-kafka-zone0-1                           ChapelHill   worker5.example.com
my-kafka-zone1-2                           Durham       worker1.example.com
my-kafka-zone1-3                           Durham       worker4.example.com
my-kafka-zone2-4                           Raleigh      worker3.example.com
strimzi-cluster-operator-558d7b695-th8mv   ChapelHill   worker5.example.com

Metadata for all topics (from broker -1: sasl_ssl://localhost:9094/bootstrap):
 5 brokers:
  broker 0 at my-kafka-zone0-0-strimzi.example.com:443
  broker 1 at my-kafka-zone0-1-strimzi.example.com:443 (controller)
  broker 2 at my-kafka-zone1-2-strimzi.example.com:443
  broker 3 at my-kafka-zone1-3-strimzi.example.com:443
  broker 4 at my-kafka-zone2-4-strimzi.example.com:443
 1 topics:
  topic "my-topic" with 5 partitions:
    partition 0, leader 4, replicas: 4,1,2, isrs: 2,4,1
    partition 1, leader 3, replicas: 1,3,4, isrs: 3,4,1
    partition 2, leader 2, replicas: 2,4,1, isrs: 2,4,1
    partition 3, leader 4, replicas: 4,1,3, isrs: 3,4,1
    partition 4, leader 3, replicas: 1,3,4, isrs: 3,4,1

Sample broker config:

  server.config: |-
    ##########
    # Node ID
    ##########
    node.id=0

    ##########
    # Rack ID
    ##########
    broker.rack=${strimzienv:STRIMZI_RACK}

...

Signed-off-by: Griffin Davis <[email protected]>

1xx-pool-rack-awareness.md

scholzj · 2025-10-28T15:48:58Z

1xx-pool-rack-id.md

+This proposal maintains CRD compatibility by introducing a new, optional field.
+All existing configurations would continue to be valid and maintain their existing behavior.
+
+## Rejected alternatives


Please keep in mind that there are also some new Kubernetes features coming to the downward API as discussed in strimzi/strimzi-kafka-operator#11504. If nothing else, that should be mentioned here. But likely we might want to wait for how it turns out.

Hi Jakub, I have this issue linked/mentioned as the second rejected alternative on line 126 https://github.com/strimzi/proposals/pull/184/files#diff-1028926b15ea23de0e061f3d424bf901a8af3d9ef217de0071b7831c0c0574d3R126

But we are not rejecting the feature. That is where we are heading because that has the potential to actually reduce the need for the RBAC rights without crippling the UX. So your proposal should not reject it. It should instead explain/convince people why it is worth investing and maintaining yet another way to do it that has inferior UX and we will not be able to get rid of easily as the better solution matures. (Given in some edge cases the Kubernetes solution might not be fully self-sufficient, it might be the 3rd way to do the same thing -> that is a LOT of maintenance effort, testing effort, etc. I would argue that Strimzi cannot afford that ... so you need to have a really solid motivation and use-cases.)

I've moved this alternative to a new section "Available alternatives" (instead of rejected) and outlined a few benefits of this proposal over that possible future enhancement: d0dbdb1

Signed-off-by: Griffin Davis <[email protected]>

scholzj

I think the APIs should look a bit differently ...

The Kafka CR needs to have a clear definition how the rack awareness is configured. You can use the typed APIs for it probably. E.g. for the current mode:
```
rack:
  type: node-based
  topologyKey: topology.kubernetes.io/zone
```
And for the new mode
```
rack:
  type: envvar
```
With the default being always node-based if type is not specified.
You would need to check in the code how exactly this can be implemented as this missed the v1 API window where it could have had a clean implementation.
When type: envvar is used, the rack will be taken from the environment variable. I'm not sure if its name should be configurable or hardcoded - something to think about. But you will not need to think about all the various options future might need for the ¨rackId` API. And environment variables can be configured already today in the Node Pool API.

scholzj · 2025-12-16T10:12:33Z

1xx-pool-rack-id.md

+implementation are prohibitive.
+Many Kubernetes cluster administrators may restrict access to cluster-scoped Kubernetes
+resources to ensure an application and the user managing it are contained within a limited set of namespaces.
+Today, Strimzi only requires access to cluster-scoped Kubernetes resources for rack-awareness and NodePort


It is required for storage resizing as well as we need to read storage classes for that.

scholzj · 2025-12-16T10:13:02Z

1xx-pool-rack-id.md

+Implementing the proposed method for pool-based rack awareness removes these potentially prohibitive
+requirements while maintaining a simple deployment model.


It does not remove these requirements -> it just makes it optional in one of the mentioned cases.

scholzj · 2025-12-16T10:13:45Z

1xx-pool-rack-id.md

+  labels:
+    strimzi.io/cluster: my-cluster
+spec:
+  rackId: zone0


What if you rely on an environment variable rather than a dedicated API? That is probably more future proof and does not need an API change.

scholzj · 2025-12-16T10:16:28Z

1xx-pool-rack-id.md

+When using a pool rack ID, users must configure pod affinity and anti-affinity in the Kafka
+pod template to ensure:
+
+* Brokers within pools with the same rack ID are scheduled in the same availability zone
+* Brokers in pools with different rack IDs are scheduled in different availability zones


I would keep this example. But I would reword it:

Strictly speaking, users should always configure affinity / topology spread constraints when using any rack awareness

One of the reason why this feature is interesting is that users do not necessarily need to define these rules. They can just control the rack ID for testing purposes etc.

Signed-off-by: Griffin Davis <[email protected]>

griffindvs · 2025-12-22T19:41:31Z

Hi @scholzj in 54da00c I've modified the API based on your suggestion.

I used node-label instead of node-based as the type name for the current behavior, but I'm happy to change that as needed. I also chose to use a hard-coded environment variable, STRIMZI_RACK. I thought it would be best to keep the implementation simple and API changes minimal unless there's a demonstrated need for custom environment variables.

I also made changes for your other comments.

Please let me know your thoughts. Thank you!

scholzj

I left some more nits ... but I think the core of the proposal is now pretty good:

It gives users a lot of flexibility
It is easy to customize through various tools (such as injecting the environment variable from webhooks etc.)
It does not put too much load on the API

Thanks for incorporating the comments!

scholzj · 2025-12-23T10:54:38Z

1xx-pool-rack-id.md

+  * This rack type maintains the existing behavior where rack IDs are configured using a node label
+  * The node label is determined using the existing `topologyKey` field
+* `envvar`
+  * This rack type uses the `STRIMZI_RACK` environment variable in the broker container to populate the rack ID


Some quick thoughts ... would STRIMZI_RACK_ID be better? Alternatively, should we make the environment variable name configurable?

scholzj · 2025-12-23T10:55:33Z

1xx-pool-rack-id.md

+  labels:
+    strimzi.io/cluster: my-cluster
+spec:
+  rackId: zone0


I guess we should replace this with the environment variable definition.

scholzj · 2025-12-23T10:57:30Z

1xx-pool-rack-id.md

+* `envvar`
+  * This rack type uses the `STRIMZI_RACK` environment variable in the broker container to populate the rack ID
+
+When a `topologyKey` is defined, the default rack type will be `node-label` to maintain existing behavior.


We should ideally make it either/or ...

You either have topologyKey and optional type: node-label

Or yu need to have type: envvar

I think the CEL validation rules can help to validate this. But we want to avoid something like this:

spec: kafka: rack: {}

scholzj · 2025-12-23T11:06:49Z

1xx-pool-rack-id.md

+For users interested in a heightened security posture, the requirements of the current rack-awareness
+implementation are prohibitive.
+Many Kubernetes cluster administrators may restrict access to cluster-scoped Kubernetes
+resources to ensure an application and the user managing it are contained within a limited set of namespaces.
+Today, Strimzi requires access to cluster-scoped Kubernetes resources for rack-awareness, NodePort
+listener configuration, and reading StorageClasses for volume resizing.


TBH, I do not think it really improves security in any meaning ful way. So I would leave out the first sentence. Also, I would mention some other situations this might be useful:

Suggested change

For users interested in a heightened security posture, the requirements of the current rack-awareness

implementation are prohibitive.

Many Kubernetes cluster administrators may restrict access to cluster-scoped Kubernetes

resources to ensure an application and the user managing it are contained within a limited set of namespaces.

Today, Strimzi requires access to cluster-scoped Kubernetes resources for rack-awareness, NodePort

listener configuration, and reading StorageClasses for volume resizing.

Today, Strimzi requires access to cluster-scoped Kubernetes resources for rack-awareness, NodePort

listener configuration, and reading StorageClasses for volume resizing.

But some Kubernetes cluster administrators may restrict access to cluster-scoped Kubernetes

resources to ensure an application and the user managing it are contained within a limited set of namespaces.

In such situations, Strimzi users were not able to use the rack-awareness feature.

With this proposal, they will be able to configure and use rack-awareness even when Strimzi does not have the access rights to create `ClusterRolesBinding` resources.

This feature might also be useful in other situations such as:

* In testing, where it allows to configure rack-awareness independently of the underlying infrastructure (it, for example, allows to test rack-awareness related features on a single-node Kubernetes cluster by manually defining different racks for different node pools)

* In future stretch-cluster environments, where automatically configured rack-awareness might not be desired (for example, when moving Kafka nodes between two Kubernetes clusters)

Add Pool-based Rack Awareness proposal

3d3f24d

Signed-off-by: Griffin Davis <[email protected]>

scholzj reviewed Oct 28, 2025

View reviewed changes

1xx-pool-rack-awareness.md Outdated Show resolved Hide resolved

scholzj reviewed Oct 28, 2025

View reviewed changes

1xx-pool-rack-awareness.md Outdated Show resolved Hide resolved

scholzj reviewed Oct 28, 2025

View reviewed changes

Update pool rack ID proposal for KNP config

d0dbdb1

Signed-off-by: Griffin Davis <[email protected]>

griffindvs changed the title ~~Add Pool-based Rack Awareness proposal~~ Add Node Pool Rack IDs proposal Dec 15, 2025

scholzj reviewed Dec 16, 2025

View reviewed changes

scholzj mentioned this pull request Dec 16, 2025

Revised Stretch Cluster Design with Pluggable Network Provider Interface #187

Open

Use envvar in node pool for rack ID

54da00c

Signed-off-by: Griffin Davis <[email protected]>

scholzj reviewed Dec 23, 2025

View reviewed changes

scholzj requested review from Frawless, PaulRMellor, im-konge, katheris, ppatierno, samuel-hawker, see-quick, sknot-rh and tombentley December 23, 2025 11:09

		Implementing the proposed method for pool-based rack awareness removes these potentially prohibitive
		requirements while maintaining a simple deployment model.

Add Node Pool Rack IDs proposal #184

Are you sure you want to change the base?

Add Node Pool Rack IDs proposal #184

Conversation

griffindvs commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

griffindvs commented Dec 22, 2025

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

griffindvs commented Oct 28, 2025 •

edited

Loading