Skip to content

Conversation

@olekzabl
Copy link
Contributor

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

It proposes a new data format of TopologyAssignment, needed to go beyond the current limitation of (effectively) 20-30k nodes supporting a single workload. We intend to use this new format in the v1beta2 API.

This is a part of #7220.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Oct 28, 2025
@netlify
Copy link

netlify bot commented Oct 28, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit b057892
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/690a8f22cec95a000956b26d

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 28, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @olekzabl. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 28, 2025
@olekzabl
Copy link
Contributor Author

/assign @PBundyra

@mimowo
Copy link
Contributor

mimowo commented Oct 28, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 28, 2025
// +required
// +listType=atomic
// +kubebuilder:validation:MinItems=1
ValuesPerLevel []TopologyAssignmentSliceLevelValues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be TopologyAssignmentSliceLevelValues or TopologyLevelAssignment

Copy link
Contributor

@mimowo mimowo Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC you are asking about the name of the type? Both will work technically, so it is not I think a matter of "should", but preference.

It is a common practice in k8s to name the types to include the parent type as prefix, to make sure there are no conflicts. Sometimes it leads to long struct names, but the struct names aren't user facing anyway, and it guarantees no conflict. TBH I like the convention to avoid thinking about the best possible name for each struct, we already put a lot of effort to come up with names for the fields (which is more important due to user facing impact) :). Still, I'm not going to object the alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching. I intended the new struct to have a new name. Fixed there.

Copy link
Contributor

@mimowo mimowo Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this "convention" is just something you would find in multiple places, but I don't think it is universally followed even in core k8s and written down, so "no hard guideline" here IIUC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimowo I saw your replies only now.

I think that @PBundyra mostly complained about the internal inconsistency of my PR - I initially had

ValuesPerLevel []TopologyAssignmentSliceLevelValues

but then

type TopologyLevelAssignment struct { ... }

The latter by mistake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for clarifying I didn't catch that. Sure, just make sure the naming is aligned to "compile".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @PBundyra mostly complained about the internal inconsistency of my PR - I initially had

That's what I meant, thanks

valuesPerLevel:
- universalValue: block-1
- commonPrefix: rack-
roots: [1, 2]
Copy link
Contributor

@PBundyra PBundyra Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was another level (let's say sub-rack):
slices:

  • size: 3
    valuesPerLevel:
    • universalValue: block-1
    • commonPrefix: rack-
      roots: [1, 2]
    • commonPrefix: sub-rack-
      roots: [a, b, c]
      counts: [3,1,2]

What is the proper combination of racks/sub-racks for particular Pods. Should it be e.g.:
1.

block-1+rack-1+sub-rack-a: 3
block-1+rack-1+sub-rack-b: 1
block-1+rack-2+sub-rack-c: 2

or

block-1+rack-1+sub-rack-a: 3
block-1+rack-2+sub-rack-b: 1
block-1+rack-2+sub-rack-c: 2

Or maybe in this case, rack-1 and rack-2 would be part of different slices?
slices:

  • size: 2
    valuesPerLevel:
    • universalValue: block-1
    • universalValue: rack-1
    • commonPrefix: sub-rack-
      roots: [a, b]
      counts: [3,1]
  • size: 1
  • valuesPerLevel:
    • universalValue: block-1
    • universalValue: rack-2
    • universalValue: sub-rack-c
  • universalCount: 1

for option 1.

Copy link
Contributor Author

@olekzabl olekzabl Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, different slices is one way to represent this.

Or, alternatively, single slice but with repetitions in Roots. E.g. for Option 1, Roots would be [1,1,2]:

- size: 3
  valuesPerLevel:
  - universalValue: block-1
  - commonPrefix: rack-
    roots: [1,1,2]
  - commonPrefix: sub-rack-
    roots: [a,b,c]
  counts: [3,1,2]

BTW Roots has the following comment on it:

If set, its length must be equal to the "size" field of the TopologyAssignmentSlice.

And (as mentioned in KEP) I intend to enforce such rules with kubebuilder XValidation.

As a result, your upper example ([1, 2] / [a, b, c]) would be rejected because size is 3 while size([1, 2]) == 2. So I'm hoping this addresses the threat of confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you still considering which one is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our short-term plan (say, for 0.16) is to always use just one slice - because that seems to suffice to fit 40k nodes (in fact, even ~60k).

Yet, the option of multiple slices is baked into the format right away, so that we can use it in more distant future.
And then, we'll have to decide on an encoding strategy.

BTW the general feeling now is that multiple slices add some mental complexity which is needed only for huge workloads - so for your particular example, I think we'd end up with 1 slice forever, just because it's small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you still considering which one is better?

The plan for 0.15 is to use single slice as a simpler solution which still allows improving to 100k with the better algo, which does not require API changes.

For now I would park the better algo idea as a potential follow up mentioned in the "Alternatives" section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@PBundyra
Copy link
Contributor

In the discussion in the original issue @mimowo commented: "Still maybe that could come in V1 as an incremental addition as extraTopologySlices and for now we could push boundaries to 40k nodes or so within the single Workload"

Should it also be in scope for this KEP change?

@mimowo
Copy link
Contributor

mimowo commented Oct 29, 2025

In the discussion in the original issue @mimowo commented: "Still maybe that could come in V1 as an incremental addition as extraTopologySlices and for now we could push boundaries to 40k nodes or so within the single Workload"

I think it still can be done as an incremental addition later, but I think out of scope for the first iteration. The API allows to add the field later while in v1beta2. Please let me know if I'm missing something.

}
```
The format of `TopologyAssignment` depends on the API version; see details [below]().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the link [below]().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@olekzabl olekzabl requested review from PBundyra and mimowo October 29, 2025 16:42
// +required
// +listType=atomic
// +kubebuilder:validation:MinItems=1
Slices []TopologyAssignmentSlice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new k8s api-linter, being worked on #7283 will require us to have also MaxItems. I know it is hard to say the cap length here, but maybe we could at least say 100k.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each slice brings an overhead of (at least)

{"size":1,"valuesPerLevel":[{"roots":["x"]}],"counts":[1]}

which is 58 chars. So, given the 1.5MiB, it feels safe to bound it by 30k items.

Done that.


But then, Roots can be more numerous. I put 100k there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Roots can be 100k, should maxItems of Counts be roughly the same. I know there's also universalCount but maybe we could bump that number anyway in case of some edge cases - or do you think those are not realistic values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I should have used the same for Counts and Roots. Thank you for catching.

Copy link
Contributor Author

@olekzabl olekzabl Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, I just did 2 changes in the limits for Slices (caused by what I learned while implementing v1beta2):

  • Reduced MaxItems from 30k to 1k.
    30k turned out too much for Kubebuilder cross-validation rules. I got complaints about some CEL expressions being too heavy (by up to 3.9x); apparently the root cause was multiplying 30k (Slices) by 100k (Roots).
    A significant reduction on the Roots side would defeat the purpose of this PR. We want to be able to allocate at least 60k Roots within a single Slice.
    OTOH, 1k Slices feels more than enough in practice.
    (Given "3.9x", reducing to 7500 would suffice, but I wanted to leave much room for any future developments. I suppose that, once rolled out, reducing that number would violate API compatibility rules).

  • Dropped "MinItems = 1", as I realized that TopologyAssignment for 0 Pods can happen (at least in this test).
    It's not entirely obvious how to represent it in the new format - but, given the current convention (TopologyAssignment != nil, inside it Domains not nil but empty) - allowing empty Slices has at least the benefit of following that convention.

    • The main disadvantage of that is keeping Levels even though they become meaningless.
      However, either of the alternatives (nullify Levels, or even the whole TopologyAssignment) can be also confusing in some way.
      So, in doubt, I'd stick to the existing convention. Unless there are objections.

// +required
// +listType=atomic
// +kubebuilder:validation:MinItems=1
ValuesPerLevel []TopologyAssignmentSliceLevelValues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new api-linter will also want MaxItems, here 16 is enough as this is the max number of items we support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

type TopologyAssignmentSliceLevelValues struct {
// commonPrefix and commonSuffix specify a common prefix & suffix for all values in this assignment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make two comments, that is the standard convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- size: 5
valuesPerLevel:
- commonPrefix: pool-1-node-
roots: [1, 2, 3, 4, 5]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roots are strings, so probably need to quote each item, right? Similarly for commonPrefix field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is YAML, not JSON. Quoting strings not required. (Neither does it happen in pre-existing examples).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok

- In the short term, it increases the number of nodes which can fit into 1 etcd entry.
- By just using a single slice, with extracting common prefix and suffix of all node names, we reach around 60k.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a short calculations and assumptions which lead to the 60k estimation. You may refer to the example from the orginal issue, but let's make it grounded in some assumptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I clarified it slightly - but I wouldn't like to go into the details of the simulations here.

Comment on lines 1062 to 1069
- Multiple slices allow optimizing even further, if desired. \
Our simulations of more complex algorithms (e.g. heuristic pruning of prefix tree) allowed fitting over 100k nodes. \
(However, at that point we reached a tradeoff between bytesize, encoding time, and conceptual simplicity. Resolving that tradeoff is out of scope of this design; the important thing is that the proposed data format supports various specific algorithms).
- In the long term, as the number of nodes grows, at some point we'll inevitably hit the 1.5MiB limit anyway. \
When this happens, we foresee a need to extract "chunks" of the whole assignment into separate instances of a dedicated CRD (analogously to how [EndpointSlice](https://github.com/kubernetes/kubernetes/blob/3b632270e9b866ee8bf62e89377ae95987671b49/pkg/apis/discovery/types.go#L24-L29) has been introduced in K8s core). \
While the v1beta2 format does not yet do that, by introducing `Slices` we come much closer to this. \
Once there is a need, we can promote (some of) `Slices` to instances of a standalone CRD - but the appropriate type system is already there.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this section as a subsection in Alternatives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first couple of lines discuss multiple slices, which I'm bringing in. So it's not an "alternative".

The second part indeed is an "Alternative" - so I moved a part of discussing it there - but still, it was important for me to underline how the current design relates to that alternative. (Namely, that it brings us closer to it).

So I tried to organize it accordingly, PTAL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mimowo
Copy link
Contributor

mimowo commented Oct 29, 2025

@olekzabl feel free to add your self also here: :)

@olekzabl olekzabl requested a review from mimowo October 30, 2025 08:33
@PBundyra
Copy link
Contributor

Looks great, thanks!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 28724d37688f9e27c3f3fecaf433ca215d1a131c

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 30, 2025
**Reasons for discarding/deferring**
- Decreased readability of the API (some info delegated to other objects).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit I would also add that even if we introduce TopologyAssignmentSlices in the future it still makes sense to optimize the size of a single slice. This way we can manage fewer slices, and thus achieve better system performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a mention below.
Verified offline with @mimowo that it matches his intent.
@tenzen-y PTAL

@mimowo
Copy link
Contributor

mimowo commented Oct 31, 2025

/lgtm
/approve
In case you still want to add the remark #7419 (comment).
Also let me wait a little in case @tenzen-y has some review comments.
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2252de258bbea0c024e28ef936c8727d4e59a80a

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, olekzabl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2025
@tenzen-y
Copy link
Member

/lgtm /approve In case you still want to add the remark #7419 (comment). Also let me wait a little in case @tenzen-y has some review comments. /hold

ACK

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds interesting enhancement 🥇

type PodSetAssignment struct {
...

// topologyAssignment indicates the topology assignment divided into
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we preserve the outdated design as v1beta1 one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand.

Please note that this red block is actually not removed.

  • Part of it (applying to both v1beta{1,2}) is moved to a general description below.
    I re-formatted it as "regular Markdown prose", no longer a Golang comment - but the prose itself is the same.
  • Another part (specific to v1beta1) is moved to a dedicated section, "Until v1beta1".

Does this address your comment?

Comment on lines +846 to +849
// TASLabel is a label set on the Job's PodTemplate to indicate that the
// PodSet is admitted using TopologyAwareScheduling, and all Pods created
// from the Job's PodTemplate also have the label.
TASLabel = "kueue.x-k8s.io/tas"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we stopped the usage of this label. Do you indicate that we need to reintroduce this label?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This was already here; I just moved it (exactly as a part of "preserving the outdated design" ;)

From your comment, I understand this has been removed, and so should be also removed from KEP.
However, I see it still used, and only planned for removal:

// TODO: remove after 0.16
if _, ok := pod.Labels[kueue.TASLabel]; ok {
return true
}

So I'm not sure what is the best KEP treatment of such "planned deprecations".

For now, I took no action because this is unrelated to the topic of this PR.

Comment on lines 993 to 998
Counts []int32 `json:"omitempty"`

// universalCount, if set, specifies the number of pods allocated in every domain in this slice.
// Exactly one of count, universalCount must be set.
// +optional
UniversalCount *int32 `json:"omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these counts and universalCount exclusive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I left a comment saying this:

// Exactly one of count, universalCount must be set.

And, as mentioned in the KEP, in the actual code I intend to enforce such rules with XValidation.
(However, for KEP comments, I prefered only prose to make these comments read easier)

// that applies to every domain in the current slice.
// Mutually exclusive with roots, commonPrefix and commonSuffix.
// +optional
UniversalValue *string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are prefix and suffix and universalValue exclusive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are. Same situation as in this thread.

**Reasons for discarding/deferring**
- Decreased readability of the API (some info delegated to other objects).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

- In the short term, it increases the number of nodes which can fit into 1 etcd entry.
- By just using a single slice, with extracting common prefix and suffix of all node names, our simulations (for some real-life node naming schemes) suggested a limit of around 60k nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you already have limitations for the new Workload w/ slice assignment? like 100k or something?
If yes, I think that is good threthold when we start considering separate slice CR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limits depend on the chosen strategy of encoding (and on the naming scheme used for nodes).

For the start, we're going to use "single slice, single common prefix & suffix" approach which seems to raise the limit to ~60k in practice. However, the boundary of "how far we can reach without CRD slices" has not yet been established.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mimowo October 31, 2025 11:23
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@olekzabl olekzabl requested a review from tenzen-y October 31, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants