-
Notifications
You must be signed in to change notification settings - Fork 451
KEP changes for v1beta2 TopologyAssignment #7419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
Hi @olekzabl. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/assign @PBundyra |
|
/ok-to-test |
| // +required | ||
| // +listType=atomic | ||
| // +kubebuilder:validation:MinItems=1 | ||
| ValuesPerLevel []TopologyAssignmentSliceLevelValues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be TopologyAssignmentSliceLevelValues or TopologyLevelAssignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC you are asking about the name of the type? Both will work technically, so it is not I think a matter of "should", but preference.
It is a common practice in k8s to name the types to include the parent type as prefix, to make sure there are no conflicts. Sometimes it leads to long struct names, but the struct names aren't user facing anyway, and it guarantees no conflict. TBH I like the convention to avoid thinking about the best possible name for each struct, we already put a lot of effort to come up with names for the fields (which is more important due to user facing impact) :). Still, I'm not going to object the alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching. I intended the new struct to have a new name. Fixed there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this "convention" is just something you would find in multiple places, but I don't think it is universally followed even in core k8s and written down, so "no hard guideline" here IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for clarifying I didn't catch that. Sure, just make sure the naming is aligned to "compile".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that @PBundyra mostly complained about the internal inconsistency of my PR - I initially had
That's what I meant, thanks
| valuesPerLevel: | ||
| - universalValue: block-1 | ||
| - commonPrefix: rack- | ||
| roots: [1, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was another level (let's say sub-rack):
slices:
- size: 3
valuesPerLevel:- universalValue: block-1
- commonPrefix: rack-
roots: [1, 2] - commonPrefix: sub-rack-
roots: [a, b, c]
counts: [3,1,2]
What is the proper combination of racks/sub-racks for particular Pods. Should it be e.g.:
1.
block-1+rack-1+sub-rack-a: 3
block-1+rack-1+sub-rack-b: 1
block-1+rack-2+sub-rack-c: 2
or
block-1+rack-1+sub-rack-a: 3
block-1+rack-2+sub-rack-b: 1
block-1+rack-2+sub-rack-c: 2
Or maybe in this case, rack-1 and rack-2 would be part of different slices?
slices:
- size: 2
valuesPerLevel:- universalValue: block-1
- universalValue: rack-1
- commonPrefix: sub-rack-
roots: [a, b]
counts: [3,1]
- size: 1
- valuesPerLevel:
- universalValue: block-1
- universalValue: rack-2
- universalValue: sub-rack-c
- universalCount: 1
for option 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, different slices is one way to represent this.
Or, alternatively, single slice but with repetitions in Roots. E.g. for Option 1, Roots would be [1,1,2]:
- size: 3
valuesPerLevel:
- universalValue: block-1
- commonPrefix: rack-
roots: [1,1,2]
- commonPrefix: sub-rack-
roots: [a,b,c]
counts: [3,1,2]BTW Roots has the following comment on it:
If set, its length must be equal to the "size" field of the TopologyAssignmentSlice.
And (as mentioned in KEP) I intend to enforce such rules with kubebuilder XValidation.
As a result, your upper example ([1, 2] / [a, b, c]) would be rejected because size is 3 while size([1, 2]) == 2. So I'm hoping this addresses the threat of confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you still considering which one is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our short-term plan (say, for 0.16) is to always use just one slice - because that seems to suffice to fit 40k nodes (in fact, even ~60k).
Yet, the option of multiple slices is baked into the format right away, so that we can use it in more distant future.
And then, we'll have to decide on an encoding strategy.
BTW the general feeling now is that multiple slices add some mental complexity which is needed only for huge workloads - so for your particular example, I think we'd end up with 1 slice forever, just because it's small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you still considering which one is better?
The plan for 0.15 is to use single slice as a simpler solution which still allows improving to 100k with the better algo, which does not require API changes.
For now I would park the better algo idea as a potential follow up mentioned in the "Alternatives" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
I think it still can be done as an incremental addition later, but I think out of scope for the first iteration. The API allows to add the field later while in v1beta2. Please let me know if I'm missing something. |
| } | ||
| ``` | ||
| The format of `TopologyAssignment` depends on the API version; see details [below](). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the link [below]().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| // +required | ||
| // +listType=atomic | ||
| // +kubebuilder:validation:MinItems=1 | ||
| Slices []TopologyAssignmentSlice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new k8s api-linter, being worked on #7283 will require us to have also MaxItems. I know it is hard to say the cap length here, but maybe we could at least say 100k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each slice brings an overhead of (at least)
{"size":1,"valuesPerLevel":[{"roots":["x"]}],"counts":[1]}
which is 58 chars. So, given the 1.5MiB, it feels safe to bound it by 30k items.
Done that.
But then, Roots can be more numerous. I put 100k there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Roots can be 100k, should maxItems of Counts be roughly the same. I know there's also universalCount but maybe we could bump that number anyway in case of some edge cases - or do you think those are not realistic values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I should have used the same for Counts and Roots. Thank you for catching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR, I just did 2 changes in the limits for Slices (caused by what I learned while implementing v1beta2):
-
Reduced
MaxItemsfrom 30k to 1k.
30k turned out too much for Kubebuilder cross-validation rules. I got complaints about some CEL expressions being too heavy (by up to 3.9x); apparently the root cause was multiplying 30k (Slices) by 100k (Roots).
A significant reduction on the Roots side would defeat the purpose of this PR. We want to be able to allocate at least 60k Roots within a single Slice.
OTOH, 1k Slices feels more than enough in practice.
(Given "3.9x", reducing to 7500 would suffice, but I wanted to leave much room for any future developments. I suppose that, once rolled out, reducing that number would violate API compatibility rules). -
Dropped "MinItems = 1", as I realized that TopologyAssignment for 0 Pods can happen (at least in this test).
It's not entirely obvious how to represent it in the new format - but, given the current convention (TopologyAssignment != nil, inside it Domains not nil but empty) - allowing empty Slices has at least the benefit of following that convention.- The main disadvantage of that is keeping
Levelseven though they become meaningless.
However, either of the alternatives (nullifyLevels, or even the wholeTopologyAssignment) can be also confusing in some way.
So, in doubt, I'd stick to the existing convention. Unless there are objections.
- The main disadvantage of that is keeping
| // +required | ||
| // +listType=atomic | ||
| // +kubebuilder:validation:MinItems=1 | ||
| ValuesPerLevel []TopologyAssignmentSliceLevelValues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new api-linter will also want MaxItems, here 16 is enough as this is the max number of items we support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| } | ||
|
|
||
| type TopologyAssignmentSliceLevelValues struct { | ||
| // commonPrefix and commonSuffix specify a common prefix & suffix for all values in this assignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make two comments, that is the standard convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| - size: 5 | ||
| valuesPerLevel: | ||
| - commonPrefix: pool-1-node- | ||
| roots: [1, 2, 3, 4, 5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roots are strings, so probably need to quote each item, right? Similarly for commonPrefix field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is YAML, not JSON. Quoting strings not required. (Neither does it happen in pre-existing examples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok
| - In the short term, it increases the number of nodes which can fit into 1 etcd entry. | ||
| - By just using a single slice, with extracting common prefix and suffix of all node names, we reach around 60k. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a short calculations and assumptions which lead to the 60k estimation. You may refer to the example from the orginal issue, but let's make it grounded in some assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I clarified it slightly - but I wouldn't like to go into the details of the simulations here.
| - Multiple slices allow optimizing even further, if desired. \ | ||
| Our simulations of more complex algorithms (e.g. heuristic pruning of prefix tree) allowed fitting over 100k nodes. \ | ||
| (However, at that point we reached a tradeoff between bytesize, encoding time, and conceptual simplicity. Resolving that tradeoff is out of scope of this design; the important thing is that the proposed data format supports various specific algorithms). | ||
| - In the long term, as the number of nodes grows, at some point we'll inevitably hit the 1.5MiB limit anyway. \ | ||
| When this happens, we foresee a need to extract "chunks" of the whole assignment into separate instances of a dedicated CRD (analogously to how [EndpointSlice](https://github.com/kubernetes/kubernetes/blob/3b632270e9b866ee8bf62e89377ae95987671b49/pkg/apis/discovery/types.go#L24-L29) has been introduced in K8s core). \ | ||
| While the v1beta2 format does not yet do that, by introducing `Slices` we come much closer to this. \ | ||
| Once there is a need, we can promote (some of) `Slices` to instances of a standalone CRD - but the appropriate type system is already there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this section as a subsection in Alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first couple of lines discuss multiple slices, which I'm bringing in. So it's not an "alternative".
The second part indeed is an "Alternative" - so I moved a part of discussing it there - but still, it was important for me to underline how the current design relates to that alternative. (Namely, that it brings us closer to it).
So I tried to organize it accordingly, PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
Looks great, thanks! |
|
LGTM label has been added. Git tree hash: 28724d37688f9e27c3f3fecaf433ca215d1a131c
|
| **Reasons for discarding/deferring** | ||
| - Decreased readability of the API (some info delegated to other objects). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit I would also add that even if we introduce TopologyAssignmentSlices in the future it still makes sense to optimize the size of a single slice. This way we can manage fewer slices, and thus achieve better system performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
/lgtm |
|
LGTM label has been added. Git tree hash: 2252de258bbea0c024e28ef936c8727d4e59a80a
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, olekzabl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ACK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds interesting enhancement 🥇
| type PodSetAssignment struct { | ||
| ... | ||
|
|
||
| // topologyAssignment indicates the topology assignment divided into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we preserve the outdated design as v1beta1 one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand.
Please note that this red block is actually not removed.
- Part of it (applying to both v1beta{1,2}) is moved to a general description below.
I re-formatted it as "regular Markdown prose", no longer a Golang comment - but the prose itself is the same. - Another part (specific to v1beta1) is moved to a dedicated section, "Until v1beta1".
Does this address your comment?
| // TASLabel is a label set on the Job's PodTemplate to indicate that the | ||
| // PodSet is admitted using TopologyAwareScheduling, and all Pods created | ||
| // from the Job's PodTemplate also have the label. | ||
| TASLabel = "kueue.x-k8s.io/tas" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we stopped the usage of this label. Do you indicate that we need to reintroduce this label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This was already here; I just moved it (exactly as a part of "preserving the outdated design" ;)
From your comment, I understand this has been removed, and so should be also removed from KEP.
However, I see it still used, and only planned for removal:
Lines 40 to 43 in 01466e7
| // TODO: remove after 0.16 | |
| if _, ok := pod.Labels[kueue.TASLabel]; ok { | |
| return true | |
| } |
So I'm not sure what is the best KEP treatment of such "planned deprecations".
For now, I took no action because this is unrelated to the topic of this PR.
| Counts []int32 `json:"omitempty"` | ||
|
|
||
| // universalCount, if set, specifies the number of pods allocated in every domain in this slice. | ||
| // Exactly one of count, universalCount must be set. | ||
| // +optional | ||
| UniversalCount *int32 `json:"omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these counts and universalCount exclusive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I left a comment saying this:
// Exactly one of count, universalCount must be set.
And, as mentioned in the KEP, in the actual code I intend to enforce such rules with XValidation.
(However, for KEP comments, I prefered only prose to make these comments read easier)
| // that applies to every domain in the current slice. | ||
| // Mutually exclusive with roots, commonPrefix and commonSuffix. | ||
| // +optional | ||
| UniversalValue *string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are prefix and suffix and universalValue exclusive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are. Same situation as in this thread.
| **Reasons for discarding/deferring** | ||
| - Decreased readability of the API (some info delegated to other objects). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| - In the short term, it increases the number of nodes which can fit into 1 etcd entry. | ||
| - By just using a single slice, with extracting common prefix and suffix of all node names, our simulations (for some real-life node naming schemes) suggested a limit of around 60k nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you already have limitations for the new Workload w/ slice assignment? like 100k or something?
If yes, I think that is good threthold when we start considering separate slice CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limits depend on the chosen strategy of encoding (and on the naming scheme used for nodes).
For the start, we're going to use "single slice, single common prefix & suffix" approach which seems to raise the limit to ~60k in practice. However, the boundary of "how far we can reach without CRD slices" has not yet been established.
|
New changes are detected. LGTM label has been removed. |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
It proposes a new data format of
TopologyAssignment, needed to go beyond the current limitation of (effectively) 20-30k nodes supporting a single workload. We intend to use this new format in the v1beta2 API.This is a part of #7220.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?