[Core] Worker nodes dying with 1000 tasks #52585

psarka · 2025-04-24T20:07:56Z

What happened + What you expected to happen

I am unable to run 1000 tasks in 1 job with ray - worker nodes start dying with Expected termination: received SIGTERM. My repro program is a simplified version of the example from the docs :

import time

import ray


@ray.remote(num_cpus=1, max_retries=0)
def process(task):
    print(f'Starting {task}')
    time.sleep(100)


if __name__ == '__main__':
    ray.init(log_to_driver=False)
    unfinished = [process.remote(i) for i in range(1000)]

    while unfinished:
        finished, unfinished = ray.wait(unfinished, num_returns=1, fetch_local=False)

When submitting this job with ray job submit, autoscaler starts spinning up worker nodes, which then proceed to die (at a slower rate, but still). For example, 12 minutes in, I see 18 alive nodes and 5 dead. With 4 tasks per node, 20 tasks have already failed.

Versions / Dependencies

ray 2.44.1, python 3.12, gcloud instances

Reproduction script

These are the commands that I'm running:

uv run ray up ray-cluster-repro.yaml -y
uv run ray attach ray-cluster-repro.yaml -p 10001
uv run ray dashboard ray-cluster-repro.yaml
RAY_ADDRESS="ray://localhost:10001" uv run ray job submit --no-wait --working-dir=./ --runtime-env=driver-env-repro.yaml -- python cluster_test.py

driver-env-repro.yaml:

env_vars:
  RAY_RUNTIME_ENV_HOOK: "ray._private.runtime_env.uv_runtime_env_hook.hook"

py_executable: "uv run"

ray-cluster-repro.yaml:

cluster_name: repro
max_workers: 1024
upscaling_speed: 1.0
docker:
  image: rayproject/ray:2.44.1-py312-cpu
  container_name: "ray_container"
  pull_before_run: True
  run_options:  # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536
  worker_image: "rayproject/ray:2.44.1-py312-cpu"

idle_timeout_minutes: 3

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: europe-west4
    availability_zone: europe-west4-a
    project_id: axial-matter-417704

auth:
    ssh_user: ubuntu

available_node_types:
    ray_head_default:
        resources: {"CPU": 0}
        node_config:
            machineType: n4-highmem-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 1000
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922

    ray_worker_n4_standard_4:
      min_workers: 1
      max_workers: 100
      resources: {"CPU": 4}
      node_config:
        machineType: n4-standard-4
        disks:
          - boot: true
            autoDelete: true
            type: PERSISTENT
            initializeParams:
              diskSizeGb: 50
              sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
        scheduling:
          - preemptible: false
        serviceAccounts:
          - email: [email protected]
            scopes:
              - https://www.googleapis.com/auth/cloud-platform


head_node_type: ray_head_default

file_mounts: {}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: [
  pip install uv
]

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Issue Severity

None

The text was updated successfully, but these errors were encountered:

psarka · 2025-04-29T08:03:17Z

Is there anything I can do to help you to investigate / narrow this down?

kevin85421 · 2025-04-29T17:58:54Z

Would you mind trying KubeRay instead? We are currently focusing on improving autoscaler in KubeRay ray-project/kuberay#2600.

psarka · 2025-05-27T07:54:03Z

I tried KubeRay and it is a total blast! 🎉

Nodes are not dying, and also spinning up very fast. For all I care this issue can be closed, thanks guys!

psarka added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 24, 2025

masoudcharkhabi added clusters core Issues that should be addressed in Ray Core stability labels Apr 25, 2025

kevin85421 added core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 29, 2025

kevin85421 added the core-autoscaler autoscaler related issues label Apr 29, 2025

hainesmichaelc added the community-backlog label May 22, 2025

kevin85421 closed this as completed May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Worker nodes dying with 1000 tasks #52585

[Core] Worker nodes dying with 1000 tasks #52585

psarka commented Apr 24, 2025 •

edited

Loading

psarka commented Apr 29, 2025

Uh oh!

kevin85421 commented Apr 29, 2025

Uh oh!

psarka commented May 27, 2025

Uh oh!

[Core] Worker nodes dying with 1000 tasks #52585

[Core] Worker nodes dying with 1000 tasks #52585

Comments

psarka commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

psarka commented Apr 29, 2025

Uh oh!

kevin85421 commented Apr 29, 2025

Uh oh!

psarka commented May 27, 2025

Uh oh!

psarka commented Apr 24, 2025 •

edited

Loading