-
Notifications
You must be signed in to change notification settings - Fork 451
Add documentation for Kubeflow Trainer v2 TrainJob integration with Kueue #7533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add documentation for Kubeflow Trainer v2 TrainJob integration with Kueue #7533
Conversation
Signed-off-by: narayanasabari <[email protected]>
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@NarayanaSabari: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: NarayanaSabari The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
|
|
Welcome @NarayanaSabari! |
|
Hi @NarayanaSabari. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good , I added comments mostly to reduce the parts which are already mentioned elsewhere and should be familiar to the kueue user
| - [Configure ClusterQueue](/docs/tasks/manage/setup_cluster_queue/) | ||
| - [Kubeflow Python SDK](https://github.com/kubeflow/sdk/) | ||
|
|
||
| ## Troubleshooting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a troubleshooting page and the points here seem redundant with that, so I would prefer to avoid copy pasting for all job types. Feel free to refer to the generic guide. We can revisit that remark if there is something specific to TrainJobs
| metadata: | ||
| name: high-priority-training | ||
| namespace: default | ||
| labels: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should rather promote using workload priority class. Also we have a page for that so instead I prefer to refer there
| memory: "32Gi" | ||
| nvidia.com/gpu: "2" | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section seems very generic for all Job types supported in queue, drop unless we have something specific
| nvidia.com/gpu: "1" | ||
| ``` | ||
|
|
||
| ## LLM Fine-Tuning with Kueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there such an example I. the kubełet docs? if so we could refere there and mention that the only difference is the added queue name label
| nominalQuota: 8 | ||
| ``` | ||
|
|
||
| ## Cleanup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like generic too
|
/ok-to-test |
What type of PR is this?
/kind documentation
/sig scheduling
/area batch
What this PR does / why we need it:
This PR adds comprehensive documentation for using Kubeflow Trainer v2 TrainJob with Kueue for workload management and scheduling. The documentation covers:
The documentation is structured to guide users from basic setup to advanced scenarios, with verified examples that match the actual Trainer v2 API.
Which issue(s) this PR fixes:
Fixes #7345 kubeflow/trainer#2919
Special notes for your reviewer:
kueue.x-k8s.io/queue-namepkg/apis/trainer/v1alpha1/trainjob_types.gomanifests/base/runtimes/Does this PR introduce a user-facing change?