KEP changes for v1beta2 TopologyAssignment #7419

olekzabl · 2025-10-28T13:49:23Z

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

It proposes a new data format of TopologyAssignment, needed to go beyond the current limitation of (effectively) 20-30k nodes supporting a single workload. We intend to use this new format in the v1beta2 API.

This is a part of #7220.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

netlify · 2025-10-28T13:49:30Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`b057892`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/690a8f22cec95a000956b26d

k8s-ci-robot · 2025-10-28T13:49:34Z

Hi @olekzabl. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

olekzabl · 2025-10-28T13:49:43Z

/assign @PBundyra

mimowo · 2025-10-28T13:52:33Z

/ok-to-test

PBundyra · 2025-10-29T13:56:11Z

keps/2724-topology-aware-scheduling/README.md

+  // +required
+  // +listType=atomic
+  // +kubebuilder:validation:MinItems=1
+  ValuesPerLevel []TopologyAssignmentSliceLevelValues


Should it be TopologyAssignmentSliceLevelValues or TopologyLevelAssignment

IIUC you are asking about the name of the type? Both will work technically, so it is not I think a matter of "should", but preference.

It is a common practice in k8s to name the types to include the parent type as prefix, to make sure there are no conflicts. Sometimes it leads to long struct names, but the struct names aren't user facing anyway, and it guarantees no conflict. TBH I like the convention to avoid thinking about the best possible name for each struct, we already put a lot of effort to come up with names for the fields (which is more important due to user facing impact) :). Still, I'm not going to object the alternative.

Thanks for catching. I intended the new struct to have a new name. Fixed there.

Actually, this "convention" is just something you would find in multiple places, but I don't think it is universally followed even in core k8s and written down, so "no hard guideline" here IIUC.

@mimowo I saw your replies only now.

I think that @PBundyra mostly complained about the internal inconsistency of my PR - I initially had

ValuesPerLevel []TopologyAssignmentSliceLevelValues

but then

type TopologyLevelAssignment struct { ... }

The latter by mistake.

Ah, thanks for clarifying I didn't catch that. Sure, just make sure the naming is aligned to "compile".

I think that @PBundyra mostly complained about the internal inconsistency of my PR - I initially had

That's what I meant, thanks

PBundyra · 2025-10-29T15:45:19Z

keps/2724-topology-aware-scheduling/README.md

+    valuesPerLevel:
+    - universalValue: block-1
+    - commonPrefix: rack-
+      roots: [1, 2]


If there was another level (let's say sub-rack):
slices:

size: 3
valuesPerLevel:

universalValue: block-1

commonPrefix: rack-
roots: [1, 2]

commonPrefix: sub-rack-
roots: [a, b, c]
counts: [3,1,2]

What is the proper combination of racks/sub-racks for particular Pods. Should it be e.g.:
1.

block-1+rack-1+sub-rack-a: 3
block-1+rack-1+sub-rack-b: 1
block-1+rack-2+sub-rack-c: 2

or

block-1+rack-1+sub-rack-a: 3
block-1+rack-2+sub-rack-b: 1
block-1+rack-2+sub-rack-c: 2

Or maybe in this case, rack-1 and rack-2 would be part of different slices?
slices:

size: 2
valuesPerLevel:

universalValue: block-1

universalValue: rack-1

commonPrefix: sub-rack-
roots: [a, b]
counts: [3,1]

size: 1

valuesPerLevel:

universalValue: block-1

universalValue: rack-2

universalValue: sub-rack-c

universalCount: 1

for option 1.

Yes, different slices is one way to represent this.

Or, alternatively, single slice but with repetitions in Roots. E.g. for Option 1, Roots would be [1,1,2]:

- size: 3 valuesPerLevel: - universalValue: block-1 - commonPrefix: rack- roots: [1,1,2] - commonPrefix: sub-rack- roots: [a,b,c] counts: [3,1,2]

BTW Roots has the following comment on it:

If set, its length must be equal to the "size" field of the TopologyAssignmentSlice.

And (as mentioned in KEP) I intend to enforce such rules with kubebuilder XValidation.

As a result, your upper example ([1, 2] / [a, b, c]) would be rejected because size is 3 while size([1, 2]) == 2. So I'm hoping this addresses the threat of confusion.

Are you still considering which one is better?

Our short-term plan (say, for 0.16) is to always use just one slice - because that seems to suffice to fit 40k nodes (in fact, even ~60k).

Yet, the option of multiple slices is baked into the format right away, so that we can use it in more distant future.
And then, we'll have to decide on an encoding strategy.

BTW the general feeling now is that multiple slices add some mental complexity which is needed only for huge workloads - so for your particular example, I think we'd end up with 1 slice forever, just because it's small.

Are you still considering which one is better?

The plan for 0.15 is to use single slice as a simpler solution which still allows improving to 100k with the better algo, which does not require API changes.

For now I would park the better algo idea as a potential follow up mentioned in the "Alternatives" section.

PBundyra · 2025-10-29T16:13:35Z

In the discussion in the original issue @mimowo commented: "Still maybe that could come in V1 as an incremental addition as extraTopologySlices and for now we could push boundaries to 40k nodes or so within the single Workload"

Should it also be in scope for this KEP change?

mimowo · 2025-10-29T16:24:45Z

In the discussion in the original issue @mimowo commented: "Still maybe that could come in V1 as an incremental addition as extraTopologySlices and for now we could push boundaries to 40k nodes or so within the single Workload"

I think it still can be done as an incremental addition later, but I think out of scope for the first iteration. The API allows to add the field later while in v1beta2. Please let me know if I'm missing something.

mimowo · 2025-10-29T16:32:02Z

keps/2724-topology-aware-scheduling/README.md

-}
+```
+
+The format of `TopologyAssignment` depends on the API version; see details [below]().


Fix the link [below]().

mimowo · 2025-10-29T16:43:13Z

keps/2724-topology-aware-scheduling/README.md

+  // +required
+  // +listType=atomic
+  // +kubebuilder:validation:MinItems=1
+  Slices []TopologyAssignmentSlice


I think the new k8s api-linter, being worked on #7283 will require us to have also MaxItems. I know it is hard to say the cap length here, but maybe we could at least say 100k.

Each slice brings an overhead of (at least)

{"size":1,"valuesPerLevel":[{"roots":["x"]}],"counts":[1]}

which is 58 chars. So, given the 1.5MiB, it feels safe to bound it by 30k items.

Done that.

But then, Roots can be more numerous. I put 100k there.

If Roots can be 100k, should maxItems of Counts be roughly the same. I know there's also universalCount but maybe we could bump that number anyway in case of some edge cases - or do you think those are not realistic values?

Oh yes, I should have used the same for Counts and Roots. Thank you for catching.

FTR, I just did 2 changes in the limits for Slices (caused by what I learned while implementing v1beta2):

Reduced MaxItems from 30k to 1k.
30k turned out too much for Kubebuilder cross-validation rules. I got complaints about some CEL expressions being too heavy (by up to 3.9x); apparently the root cause was multiplying 30k (Slices) by 100k (Roots).
A significant reduction on the Roots side would defeat the purpose of this PR. We want to be able to allocate at least 60k Roots within a single Slice.
OTOH, 1k Slices feels more than enough in practice.
(Given "3.9x", reducing to 7500 would suffice, but I wanted to leave much room for any future developments. I suppose that, once rolled out, reducing that number would violate API compatibility rules).

Dropped "MinItems = 1", as I realized that TopologyAssignment for 0 Pods can happen (at least in this test).
It's not entirely obvious how to represent it in the new format - but, given the current convention (TopologyAssignment != nil, inside it Domains not nil but empty) - allowing empty Slices has at least the benefit of following that convention.

The main disadvantage of that is keeping Levels even though they become meaningless.
However, either of the alternatives (nullify Levels, or even the whole TopologyAssignment) can be also confusing in some way.
So, in doubt, I'd stick to the existing convention. Unless there are objections.

mimowo · 2025-10-29T16:43:50Z

keps/2724-topology-aware-scheduling/README.md

+  // +required
+  // +listType=atomic
+  // +kubebuilder:validation:MinItems=1
+  ValuesPerLevel []TopologyAssignmentSliceLevelValues


I think the new api-linter will also want MaxItems, here 16 is enough as this is the max number of items we support

mimowo · 2025-10-29T16:44:54Z

keps/2724-topology-aware-scheduling/README.md

+}
+
+type TopologyAssignmentSliceLevelValues struct {
+  // commonPrefix and commonSuffix specify a common prefix & suffix for all values in this assignment.


Let's make two comments, that is the standard convention.

mimowo · 2025-10-29T16:46:14Z

keps/2724-topology-aware-scheduling/README.md

+  - size: 5
+    valuesPerLevel:
+    - commonPrefix: pool-1-node-
+      roots: [1, 2, 3, 4, 5]


roots are strings, so probably need to quote each item, right? Similarly for commonPrefix field

This is YAML, not JSON. Quoting strings not required. (Neither does it happen in pre-existing examples).

mimowo · 2025-10-29T16:47:56Z

keps/2724-topology-aware-scheduling/README.md

+
+- In the short term, it increases the number of nodes which can fit into 1 etcd entry.
+
+  - By just using a single slice, with extracting common prefix and suffix of all node names, we reach around 60k.


Please provide a short calculations and assumptions which lead to the 60k estimation. You may refer to the example from the orginal issue, but let's make it grounded in some assumptions.

As discussed offline, I clarified it slightly - but I wouldn't like to go into the details of the simulations here.

mimowo · 2025-10-29T16:48:25Z

keps/2724-topology-aware-scheduling/README.md

+  - Multiple slices allow optimizing even further, if desired. \
+    Our simulations of more complex algorithms (e.g. heuristic pruning of prefix tree) allowed fitting over 100k nodes. \
+    (However, at that point we reached a tradeoff between bytesize, encoding time, and conceptual simplicity. Resolving that tradeoff is out of scope of this design; the important thing is that the proposed data format supports various specific algorithms).
+
+- In the long term, as the number of nodes grows, at some point we'll inevitably hit the 1.5MiB limit anyway. \
+  When this happens, we foresee a need to extract "chunks" of the whole assignment into separate instances of a dedicated CRD (analogously to how [EndpointSlice](https://github.com/kubernetes/kubernetes/blob/3b632270e9b866ee8bf62e89377ae95987671b49/pkg/apis/discovery/types.go#L24-L29) has been introduced in K8s core). \
+  While the v1beta2 format does not yet do that, by introducing `Slices` we come much closer to this. \
+  Once there is a need, we can promote (some of) `Slices` to instances of a standalone CRD - but the appropriate type system is already there.


Let's move this section as a subsection in Alternatives.

The first couple of lines discuss multiple slices, which I'm bringing in. So it's not an "alternative".

The second part indeed is an "Alternative" - so I moved a part of discussing it there - but still, it was important for me to underline how the current design relates to that alternative. (Namely, that it brings us closer to it).

So I tried to organize it accordingly, PTAL.

mimowo · 2025-10-29T16:52:58Z

@olekzabl feel free to add your self also here: :)

PBundyra · 2025-10-30T11:15:22Z

Looks great, thanks!
/lgtm

k8s-ci-robot · 2025-10-30T11:15:28Z

LGTM label has been added.

Git tree hash: 28724d37688f9e27c3f3fecaf433ca215d1a131c

mimowo · 2025-10-31T07:41:09Z

keps/2724-topology-aware-scheduling/README.md

+
+**Reasons for discarding/deferring**
+
+- Decreased readability of the API (some info delegated to other objects).


nit I would also add that even if we introduce TopologyAssignmentSlices in the future it still makes sense to optimize the size of a single slice. This way we can manage fewer slices, and thus achieve better system performance.

I added a mention below.
Verified offline with @mimowo that it matches his intent.
@tenzen-y PTAL

mimowo · 2025-10-31T07:43:53Z

/lgtm
/approve
In case you still want to add the remark #7419 (comment).
Also let me wait a little in case @tenzen-y has some review comments.
/hold

k8s-ci-robot · 2025-10-31T07:44:01Z

LGTM label has been added.

Git tree hash: 2252de258bbea0c024e28ef936c8727d4e59a80a

k8s-ci-robot · 2025-10-31T07:44:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, olekzabl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2025-10-31T10:23:52Z

/lgtm /approve In case you still want to add the remark #7419 (comment). Also let me wait a little in case @tenzen-y has some review comments. /hold

ACK

tenzen-y

That sounds interesting enhancement 🥇

tenzen-y · 2025-10-31T10:24:40Z

keps/2724-topology-aware-scheduling/README.md

 type PodSetAssignment struct {
  ...

-  // topologyAssignment indicates the topology assignment divided into


Could we preserve the outdated design as v1beta1 one?

I'm not sure if I understand.

Please note that this red block is actually not removed.

Part of it (applying to both v1beta{1,2}) is moved to a general description below.
I re-formatted it as "regular Markdown prose", no longer a Golang comment - but the prose itself is the same.

Another part (specific to v1beta1) is moved to a dedicated section, "Until v1beta1".

Does this address your comment?

tenzen-y · 2025-10-31T10:34:56Z

keps/2724-topology-aware-scheduling/README.md

+  // TASLabel is a label set on the Job's PodTemplate to indicate that the
+  // PodSet is admitted using TopologyAwareScheduling, and all Pods created
+  // from the Job's PodTemplate also have the label.
+  TASLabel = "kueue.x-k8s.io/tas"


IIRC, we stopped the usage of this label. Do you indicate that we need to reintroduce this label?

No. This was already here; I just moved it (exactly as a part of "preserving the outdated design" ;)

From your comment, I understand this has been removed, and so should be also removed from KEP.
However, I see it still used, and only planned for removal:

kueue/pkg/util/tas/tas.go

Lines 40 to 43 in 01466e7

// TODO: remove after 0.16

if _, ok := pod.Labels[kueue.TASLabel]; ok {

return true

}

So I'm not sure what is the best KEP treatment of such "planned deprecations".

For now, I took no action because this is unrelated to the topic of this PR.

keps/2724-topology-aware-scheduling/README.md

tenzen-y · 2025-10-31T11:04:59Z

keps/2724-topology-aware-scheduling/README.md

+  Counts []int32 `json:"omitempty"`
+
+  // universalCount, if set, specifies the number of pods allocated in every domain in this slice.
+  // Exactly one of count, universalCount must be set.
+  // +optional
+  UniversalCount *int32 `json:"omitempty"`


Are these counts and universalCount exclusive?

Yes. I left a comment saying this:

// Exactly one of count, universalCount must be set.

And, as mentioned in the KEP, in the actual code I intend to enforce such rules with XValidation.
(However, for KEP comments, I prefered only prose to make these comments read easier)

keps/2724-topology-aware-scheduling/README.md

tenzen-y · 2025-10-31T11:14:36Z

keps/2724-topology-aware-scheduling/README.md

+  // that applies to every domain in the current slice.
+  // Mutually exclusive with roots, commonPrefix and commonSuffix.
+  // +optional
+  UniversalValue *string


Are prefix and suffix and universalValue exclusive?

Yes, they are. Same situation as in this thread.

keps/2724-topology-aware-scheduling/README.md

tenzen-y · 2025-10-31T11:19:00Z

keps/2724-topology-aware-scheduling/README.md

+
+**Reasons for discarding/deferring**
+
+- Decreased readability of the API (some info delegated to other objects).


tenzen-y · 2025-10-31T11:20:49Z

keps/2724-topology-aware-scheduling/README.md

+
+- In the short term, it increases the number of nodes which can fit into 1 etcd entry.
+
+  - By just using a single slice, with extracting common prefix and suffix of all node names, our simulations (for some real-life node naming schemes) suggested a limit of around 60k nodes.


Do you already have limitations for the new Workload w/ slice assignment? like 100k or something?
If yes, I think that is good threthold when we start considering separate slice CR.

The limits depend on the chosen strategy of encoding (and on the naming scheme used for nodes).

For the start, we're going to use "single slice, single common prefix & suffix" approach which seems to raise the limit to ~60k in practice. However, the boundary of "how far we can reach without CRD slices" has not yet been established.

k8s-ci-robot · 2025-10-31T11:23:31Z

New changes are detected. LGTM label has been removed.

Add proposed v1beta2 API

674414c

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Oct 28, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 28, 2025

k8s-ci-robot requested review from gabesaba and mbobrovskyi October 28, 2025 13:49

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 28, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 28, 2025

k8s-ci-robot assigned PBundyra Oct 28, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 28, 2025

PBundyra reviewed Oct 29, 2025

View reviewed changes

Unify type names; remove a cluttering comment

a914c0c

mimowo reviewed Oct 29, 2025

View reviewed changes

Fix broken link

3731a2e

olekzabl requested review from PBundyra and mimowo October 29, 2025 16:42

mimowo reviewed Oct 29, 2025

View reviewed changes

Addressing comments

723b380

olekzabl requested a review from mimowo October 30, 2025 08:33

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2025

Small fixes

ff56f1c

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2025

mimowo reviewed Oct 31, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2025

k8s-ci-robot assigned mimowo Oct 31, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2025

tenzen-y reviewed Oct 31, 2025

View reviewed changes

One more comment addressed

50dd0ff

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2025

k8s-ci-robot requested a review from mimowo October 31, 2025 11:23

Address comments

79c9de6

olekzabl requested a review from tenzen-y October 31, 2025 14:03

Change limits for Slices (implementation-driven)

b057892


		- In the short term, it increases the number of nodes which can fit into 1 etcd entry.

		- By just using a single slice, with extracting common prefix and suffix of all node names, we reach around 60k.


		Reasons for discarding/deferring

		- Decreased readability of the API (some info delegated to other objects).

	// TODO: remove after 0.16
	if _, ok := pod.Labels[kueue.TASLabel]; ok {
	return true
	}

KEP changes for v1beta2 TopologyAssignment #7419

Are you sure you want to change the base?

KEP changes for v1beta2 TopologyAssignment #7419

Conversation

olekzabl commented Oct 28, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

k8s-ci-robot commented Oct 28, 2025

Uh oh!

olekzabl commented Oct 28, 2025

Uh oh!

mimowo commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimowo Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PBundyra Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olekzabl Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PBundyra commented Oct 29, 2025

Uh oh!

mimowo commented Oct 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olekzabl Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

netlify bot commented Oct 28, 2025 •

edited

Loading

mimowo Oct 29, 2025 •

edited

Loading

mimowo Oct 29, 2025 •

edited

Loading

PBundyra Oct 29, 2025 •

edited

Loading

olekzabl Oct 29, 2025 •

edited

Loading

olekzabl Nov 4, 2025 •

edited

Loading