Skip to content

Conversation

@NarayanaSabari
Copy link

What type of PR is this?

/kind documentation
/sig scheduling
/area batch

What this PR does / why we need it:

This PR adds comprehensive documentation for using Kubeflow Trainer v2 TrainJob with Kueue for workload management and scheduling. The documentation covers:

  • Overview of TrainJob integration with Kueue
  • Queue selection and suspend field configuration
  • Using ClusterTrainingRuntime (cluster-scoped) with complete PyTorch DDP example
  • Using TrainingRuntime (namespace-scoped) for team-specific custom configurations
  • Priority classes with Kueue PriorityClass
  • LLM fine-tuning use cases with TorchTune
  • Gang scheduling configuration with ClusterQueue
  • Monitoring TrainJob status with kubectl
  • Troubleshooting common issues (admission, startup, pod creation)
  • Best practices for production deployments
  • Migration notes highlighting differences from Training Operator v1

The documentation is structured to guide users from basic setup to advanced scenarios, with verified examples that match the actual Trainer v2 API.

Which issue(s) this PR fixes:

Fixes #7345 kubeflow/trainer#2919

Special notes for your reviewer:

  • All API versions, field names, and label formats have been verified against the actual Trainer v2 codebase
  • Examples use the correct queue label: kueue.x-k8s.io/queue-name
  • RuntimeRef structure matches the Go types defined in pkg/apis/trainer/v1alpha1/trainjob_types.go
  • ClusterTrainingRuntime examples align with manifests in manifests/base/runtimes/
  • Documentation follows the existing style and structure of other Kubeflow integration docs (TFJob, PyTorchJob, etc.)
  • Updated the index page to prominently feature Trainer v2 as the recommended option

Does this PR introduce a user-facing change?

Added documentation for using Kubeflow Trainer v2 TrainJob with Kueue, covering both ClusterTrainingRuntime and namespace-scoped TrainingRuntime configurations.

Signed-off-by: narayanasabari <[email protected]>
@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 5, 2025
@k8s-ci-robot
Copy link
Contributor

@NarayanaSabari: The label(s) area/batch cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?

/kind documentation
/sig scheduling
/area batch

What this PR does / why we need it:

This PR adds comprehensive documentation for using Kubeflow Trainer v2 TrainJob with Kueue for workload management and scheduling. The documentation covers:

  • Overview of TrainJob integration with Kueue
  • Queue selection and suspend field configuration
  • Using ClusterTrainingRuntime (cluster-scoped) with complete PyTorch DDP example
  • Using TrainingRuntime (namespace-scoped) for team-specific custom configurations
  • Priority classes with Kueue PriorityClass
  • LLM fine-tuning use cases with TorchTune
  • Gang scheduling configuration with ClusterQueue
  • Monitoring TrainJob status with kubectl
  • Troubleshooting common issues (admission, startup, pod creation)
  • Best practices for production deployments
  • Migration notes highlighting differences from Training Operator v1

The documentation is structured to guide users from basic setup to advanced scenarios, with verified examples that match the actual Trainer v2 API.

Which issue(s) this PR fixes:

Fixes #7345 kubeflow/trainer#2919

Special notes for your reviewer:

  • All API versions, field names, and label formats have been verified against the actual Trainer v2 codebase
  • Examples use the correct queue label: kueue.x-k8s.io/queue-name
  • RuntimeRef structure matches the Go types defined in pkg/apis/trainer/v1alpha1/trainjob_types.go
  • ClusterTrainingRuntime examples align with manifests in manifests/base/runtimes/
  • Documentation follows the existing style and structure of other Kubeflow integration docs (TFJob, PyTorchJob, etc.)
  • Updated the index page to prominently feature Trainer v2 as the recommended option

Does this PR introduce a user-facing change?

Added documentation for using Kubeflow Trainer v2 TrainJob with Kueue, covering both ClusterTrainingRuntime and namespace-scoped TrainingRuntime configurations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: NarayanaSabari
Once this PR has been reviewed and has the lgtm label, please assign moficodes for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Nov 5, 2025

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 42c30c9
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/690addc3cc69970008409711
😎 Deploy Preview https://deploy-preview-7533--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 5, 2025

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: NarayanaSabari / name: Sabari Narayana (42c30c9)

@k8s-ci-robot
Copy link
Contributor

Welcome @NarayanaSabari!

It looks like this is your first PR to kubernetes-sigs/kueue 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kueue has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 5, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @NarayanaSabari. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 5, 2025
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good , I added comments mostly to reduce the parts which are already mentioned elsewhere and should be familiar to the kueue user

- [Configure ClusterQueue](/docs/tasks/manage/setup_cluster_queue/)
- [Kubeflow Python SDK](https://github.com/kubeflow/sdk/)

## Troubleshooting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a troubleshooting page and the points here seem redundant with that, so I would prefer to avoid copy pasting for all job types. Feel free to refer to the generic guide. We can revisit that remark if there is something specific to TrainJobs

metadata:
name: high-priority-training
namespace: default
labels:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rather promote using workload priority class. Also we have a page for that so instead I prefer to refer there

memory: "32Gi"
nvidia.com/gpu: "2"
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems very generic for all Job types supported in queue, drop unless we have something specific

nvidia.com/gpu: "1"
```

## LLM Fine-Tuning with Kueue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there such an example I. the kubełet docs? if so we could refere there and mention that the only difference is the added queue name label

nominalQuota: 8
```

## Cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like generic too

@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/documentation Categorizes issue or PR as related to documentation. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Documentation page for using Kubeflow Trainer v2 with Kueue

3 participants