-
Notifications
You must be signed in to change notification settings - Fork 540
[Roadmap] KubeRay (or anything for Ray on K8s) v1.4.0 Wishlist #2999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Make Autoscaler V2 to be the default autoscaler option:
|
RayService incremental upgrade:
|
Standardize KubeRay API server
|
Idle cluster termination: #2998
|
Documentation and Terraform for the reference architecture
|
Light-weight job submitter:
|
Integrate Volcano with RayJob: currently, Volcano only integrates with RayCluster. |
Integrate YuniKorn with RayJob: currently, YuniKorn only integrates with RayCluster. |
Support cron scheduling in RayJob |
KubeRay operator emits metrics about cluster startup time and others |
Multi-k8s support:
|
Multi-k8s / Multi-cloud support:
|
Better support post-training libraries such as veRL and OpenRLHF |
Ray IPv6 support, currently it is not possible to use Ray on an IPv6 only kubernetes cluster |
@kevin85421, @andrewsykim and I wrote some ideas in this Google doc Ray Kubectl Plugin 1.4.0 Wishlist. Let us know if you'd like the ideas as individual comments here. |
Ability to limit total size of Ray cluster (across all worker groups, or a ideally for selected subsets of groups) in terms of amount of resources (cpus, gpus), rather than number of workers. That's what Kubernetes nodepools support, for example, but it is not usable in Kuberay because the autoscaler only thinks in terms of Ray worker groups, not underlying nodepools and will happily provision pods beyond the CPU limits of the available nodepools for example. |
Interested in how notification should work. We are currently using a very jank solution of using Kyverno to inject a command in the job submitter pod to deposit a notification event on our kafka queue. So we turn the job submitter pod command to something like |
This architecture seems like a great reference for considering KubeRay’s multi-cloud support and its integration with SkyPilot. SkyRay: Seamlessly Extending KubeRay to Multi-Cluster Multi-Cloud Operation |
very small, low prio request: Make
RayCluster creation for these worker group replica user errors that are currently silently handled by the controller here. kuberay/ray-operator/controllers/ray/utils/util.go Lines 336 to 345 in 35bbd62
Idea is to leave the controller behavior as is, but add more validation to the webhook for users who have chosen to enable the webhook. |
can you help to support huge LLM inference in cross node case? |
Add #3271 |
Uh oh!
There was an error while loading. Please reload this page.
What feature do you want to have in KubeRay or Ray? Please add an emoji to the following comments if you find them useful. Please briefly explain the feature you want in a single comment. This issue is not for discussion, only for voting and proposing. For discussion, send a message to #kuberay-discuss.
The text was updated successfully, but these errors were encountered: