[Core] Worker nodes dying with 1000 tasks #52585
Labels
bug
Something that is supposed to be working; but isn't
clusters
community-backlog
core
Issues that should be addressed in Ray Core
core-autoscaler
autoscaler related issues
core-clusters
For launching and managing Ray clusters/jobs/kubernetes
P2
Important issue, but not time-critical
stability
Uh oh!
There was an error while loading. Please reload this page.
What happened + What you expected to happen
I am unable to run 1000 tasks in 1 job with ray - worker nodes start dying with
Expected termination: received SIGTERM
. My repro program is a simplified version of the example from the docs :When submitting this job with ray job submit, autoscaler starts spinning up worker nodes, which then proceed to die (at a slower rate, but still). For example, 12 minutes in, I see 18 alive nodes and 5 dead. With 4 tasks per node, 20 tasks have already failed.
Versions / Dependencies
ray 2.44.1, python 3.12, gcloud instances
Reproduction script
These are the commands that I'm running:
driver-env-repro.yaml
:ray-cluster-repro.yaml
:Issue Severity
None
The text was updated successfully, but these errors were encountered: