[Roadmap] KubeRay (or anything for Ray on K8s) v1.4.0 Wishlist #2999

kevin85421 · 2025-02-10T21:48:56Z

What feature do you want to have in KubeRay or Ray? Please add an emoji to the following comments if you find them useful. Please briefly explain the feature you want in a single comment. This issue is not for discussion, only for voting and proposing. For discussion, send a message to #kuberay-discuss.

kevin85421 · 2025-02-10T21:51:24Z

Make Autoscaler V2 to be the default autoscaler option:

[Umbrella] Autoscaler improvements #2600
It should have better stability and observability compared to v1 after fixing issues.
Run Autoscaler V2 in a separate Pod instead of a container in the head Pod.

kevin85421 · 2025-02-10T21:52:35Z

RayService incremental upgrade:

REP: KubeRay RayService Incremental Rollout enhancements#58
This avoids 2X computing resources during the zero downtime upgrade process.

kevin85421 · 2025-02-10T21:57:00Z

Standardize KubeRay API server

We found that some users built their own KubeRay API server. Standardizing the KubeRay API server speeds up future user adoption because they won't need to build their own KubeRay API server again.
Make KubeRay API server flexible: Currently, the KubeRay API server interface is not flexible. Users need to open a PR if a field is not exposed.

kevin85421 · 2025-02-10T21:58:21Z

Idle cluster termination: #2998

Terminate a RayCluster if there is no running Ray job to save $$$$.

kevin85421 · 2025-02-10T22:01:04Z

Documentation and Terraform for the reference architecture

For example, GPU scheduling, reduce image pulling overhead, logging, notifications, ... etc

kevin85421 · 2025-02-10T22:10:46Z

Light-weight job submitter:

[Feature] Light-weight job submitter #2537
This allows the K8s job submitter to avoid pulling the Ray image, which is typically over 1 GB even in its thinnest version without ML libraries. The light-weight job submitter I expected to be less than 20 MB. This will enhance the startup time of RayJob.

kevin85421 · 2025-02-10T22:11:48Z

Integrate Volcano with RayJob: currently, Volcano only integrates with RayCluster.

kevin85421 · 2025-02-10T22:12:08Z

Integrate YuniKorn with RayJob: currently, YuniKorn only integrates with RayCluster.

kevin85421 · 2025-02-10T22:13:01Z

Support cron scheduling in RayJob

[Feature] Support cron scheduling for RayJob #2426

kevin85421 · 2025-02-10T22:16:21Z

KubeRay dashboard:

A frontend to visualize and manage (e.g. create / delete) KubeRay custom resources.
Something similar to the frontend from Roblox's Ray Summit talk

kevin85421 · 2025-02-10T22:18:24Z

KubeRay operator emits metrics about cluster startup time and others

[Feature] Emit metrics of cluster creation and other related metrics for observability #2681
Example: [Feature] Emit metrics of cluster creation and other related metrics for observability #2681 (comment)

kevin85421 · 2025-02-11T01:35:31Z

Multi-k8s support:

Better integration with https://kueue.sigs.k8s.io/docs/concepts/multikueue/

kevin85421 · 2025-02-11T01:35:47Z

Multi-k8s / Multi-cloud support:

Better integration with SkyPilot

kevin85421 · 2025-02-11T07:54:10Z

Better support post-training libraries such as veRL and OpenRLHF

aqemia-aymeric-alixe · 2025-02-11T09:50:15Z

Ray IPv6 support, currently it is not possible to use Ray on an IPv6 only kubernetes cluster

Ray IPv6 support ray#44252. / [WIP][Feature commit] Initial commit for supporting IPv6 stack in Ray Clus… ray#40332 . ipv6 ray#6967

davidxia · 2025-02-11T15:13:24Z

@kevin85421, @andrewsykim and I wrote some ideas in this Google doc Ray Kubectl Plugin 1.4.0 Wishlist. Let us know if you'd like the ideas as individual comments here.

jleben · 2025-02-22T10:55:53Z

Ability to limit total size of Ray cluster (across all worker groups, or a ideally for selected subsets of groups) in terms of amount of resources (cpus, gpus), rather than number of workers. That's what Kubernetes nodepools support, for example, but it is not usable in Kuberay because the autoscaler only thinks in terms of Ray worker groups, not underlying nodepools and will happily provision pods beyond the CPU limits of the available nodepools for example.

han-steve · 2025-02-24T19:02:39Z

Documentation and Terraform for the reference architecture
For example, GPU scheduling, reduce image pulling overhead, logging, notifications, ... etc

Interested in how notification should work. We are currently using a very jank solution of using Kyverno to inject a command in the job submitter pod to deposit a notification event on our kafka queue. So we turn the job submitter pod command to something like bash -c "ray submit ... && send notification". But this solution has all sorts of Kyverno bugs, and we are working on migrating the logic to the controller. What is a good way to open source notification sending in the controller?

nadongjun · 2025-02-25T01:18:47Z

Multi-k8s / Multi-cloud support:

Better integration with SkyPilot

This architecture seems like a great reference for considering KubeRay’s multi-cloud support and its integration with SkyPilot.

SkyRay: Seamlessly Extending KubeRay to Multi-Cluster Multi-Cloud Operation

davidxia · 2025-02-26T16:45:38Z

very small, low prio request: Make raycluster_webhook.go fail hard here

kuberay/ray-operator/pkg/webhooks/v1/raycluster_webhook.go

Line 86 in 35bbd62

    
           func (w *RayClusterWebhook) validateWorkerGroups(rayCluster *rayv1.RayCluster) *field.Error {

with actionable error message during RayCluster creation for these worker group replica user errors that are currently silently handled by the controller here.

kuberay/ray-operator/controllers/ray/utils/util.go

Lines 336 to 345 in 35bbd62

    
           if *workerGroupSpec.MinReplicas > *workerGroupSpec.MaxReplicas { 
        
           	log.Info("minReplicas is greater than maxReplicas, using maxReplicas as desired replicas. "+ 
        
           		"Please fix this to avoid any unexpected behaviors.", "minReplicas", *workerGroupSpec.MinReplicas, "maxReplicas", *workerGroupSpec.MaxReplicas) 
        
           	workerReplicas = *workerGroupSpec.MaxReplicas 
        
           } else if workerGroupSpec.Replicas == nil || *workerGroupSpec.Replicas < *workerGroupSpec.MinReplicas { 
        
           	// Replicas is impossible to be nil as it has a default value assigned in the CRD. 
        
           	// Add this check to make testing easier. 
        
           	workerReplicas = *workerGroupSpec.MinReplicas 
        
           } else if *workerGroupSpec.Replicas > *workerGroupSpec.MaxReplicas { 
        
           	workerReplicas = *workerGroupSpec.MaxReplicas

Idea is to leave the controller behavior as is, but add more validation to the webhook for users who have chosen to enable the webhook.

ganisback · 2025-03-04T11:18:03Z

can you help to support huge LLM inference in cross node case?
#2323

rueian · 2025-04-03T02:29:18Z

Add #3271

kevin85421 pinned this issue Feb 10, 2025

kevin85421 assigned kevin85421 and andrewsykim Feb 10, 2025

kevin85421 changed the title ~~[Roadmap] KubeRay v1.4.0 Wishlist~~ [Roadmap] KubeRay (or anything for Ray on K8s) v1.4.0 Wishlist Feb 10, 2025

kevin85421 mentioned this issue Feb 25, 2025

[Umbrella] Autoscaler improvements #2600

Open

51 tasks

owenowenisme mentioned this issue Apr 8, 2025

[Feature][RayJob] Integrate YuniKorn with RayJob #3284

Open

2 tasks

[Roadmap] KubeRay (or anything for Ray on K8s) v1.4.0 Wishlist #2999

[Roadmap] KubeRay (or anything for Ray on K8s) v1.4.0 Wishlist #2999

Comments

kevin85421 commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kevin85421 commented Feb 10, 2025

Uh oh!

kevin85421 commented Feb 10, 2025

Uh oh!

kevin85421 commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevin85421 commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevin85421 commented Feb 10, 2025

Uh oh!

kevin85421 commented Feb 10, 2025

Uh oh!

kevin85421 commented Feb 10, 2025

Uh oh!

kevin85421 commented Feb 10, 2025

Uh oh!

kevin85421 commented Feb 10, 2025

Uh oh!

kevin85421 commented Feb 10, 2025

Uh oh!

kevin85421 commented Feb 10, 2025

Uh oh!

kevin85421 commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevin85421 commented Feb 11, 2025

Uh oh!

kevin85421 commented Feb 11, 2025

Uh oh!

aqemia-aymeric-alixe commented Feb 11, 2025

Uh oh!

davidxia commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jleben commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

han-steve commented Feb 24, 2025

Uh oh!

nadongjun commented Feb 25, 2025

Uh oh!

davidxia commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ganisback commented Mar 4, 2025

Uh oh!

rueian commented Apr 3, 2025

Uh oh!

kevin85421 commented Feb 10, 2025 •

edited

Loading

kevin85421 commented Feb 10, 2025 •

edited

Loading

kevin85421 commented Feb 10, 2025 •

edited

Loading

kevin85421 commented Feb 11, 2025 •

edited

Loading

davidxia commented Feb 11, 2025 •

edited

Loading

jleben commented Feb 22, 2025 •

edited

Loading

davidxia commented Feb 26, 2025 •

edited

Loading