Skip to content

[Core] Worker nodes dying with 1000 tasks #52585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
psarka opened this issue Apr 24, 2025 · 3 comments
Closed

[Core] Worker nodes dying with 1000 tasks #52585

psarka opened this issue Apr 24, 2025 · 3 comments
Labels
bug Something that is supposed to be working; but isn't clusters community-backlog core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical stability

Comments

@psarka
Copy link

psarka commented Apr 24, 2025

What happened + What you expected to happen

I am unable to run 1000 tasks in 1 job with ray - worker nodes start dying with Expected termination: received SIGTERM. My repro program is a simplified version of the example from the docs :

import time

import ray


@ray.remote(num_cpus=1, max_retries=0)
def process(task):
    print(f'Starting {task}')
    time.sleep(100)


if __name__ == '__main__':
    ray.init(log_to_driver=False)
    unfinished = [process.remote(i) for i in range(1000)]

    while unfinished:
        finished, unfinished = ray.wait(unfinished, num_returns=1, fetch_local=False)

When submitting this job with ray job submit, autoscaler starts spinning up worker nodes, which then proceed to die (at a slower rate, but still). For example, 12 minutes in, I see 18 alive nodes and 5 dead. With 4 tasks per node, 20 tasks have already failed.

Versions / Dependencies

ray 2.44.1, python 3.12, gcloud instances

Reproduction script

These are the commands that I'm running:

uv run ray up ray-cluster-repro.yaml -y
uv run ray attach ray-cluster-repro.yaml -p 10001
uv run ray dashboard ray-cluster-repro.yaml
RAY_ADDRESS="ray://localhost:10001" uv run ray job submit --no-wait --working-dir=./ --runtime-env=driver-env-repro.yaml -- python cluster_test.py

driver-env-repro.yaml:

env_vars:
  RAY_RUNTIME_ENV_HOOK: "ray._private.runtime_env.uv_runtime_env_hook.hook"

py_executable: "uv run"

ray-cluster-repro.yaml:

cluster_name: repro
max_workers: 1024
upscaling_speed: 1.0
docker:
  image: rayproject/ray:2.44.1-py312-cpu
  container_name: "ray_container"
  pull_before_run: True
  run_options:  # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536
  worker_image: "rayproject/ray:2.44.1-py312-cpu"

idle_timeout_minutes: 3

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: europe-west4
    availability_zone: europe-west4-a
    project_id: axial-matter-417704

auth:
    ssh_user: ubuntu

available_node_types:
    ray_head_default:
        resources: {"CPU": 0}
        node_config:
            machineType: n4-highmem-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 1000
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922

    ray_worker_n4_standard_4:
      min_workers: 1
      max_workers: 100
      resources: {"CPU": 4}
      node_config:
        machineType: n4-standard-4
        disks:
          - boot: true
            autoDelete: true
            type: PERSISTENT
            initializeParams:
              diskSizeGb: 50
              sourceImage: projects/deeplearning-platform-release/global/images/common-cpu-v20240922
        scheduling:
          - preemptible: false
        serviceAccounts:
          - email: [email protected]
            scopes:
              - https://www.googleapis.com/auth/cloud-platform


head_node_type: ray_head_default

file_mounts: {}

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: [
  pip install uv
]

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - >-
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - >-
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Issue Severity

None

@psarka psarka added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 24, 2025
@masoudcharkhabi masoudcharkhabi added clusters core Issues that should be addressed in Ray Core stability labels Apr 25, 2025
@psarka
Copy link
Author

psarka commented Apr 29, 2025

Is there anything I can do to help you to investigate / narrow this down?

@kevin85421 kevin85421 added core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 29, 2025
@kevin85421
Copy link
Member

Would you mind trying KubeRay instead? We are currently focusing on improving autoscaler in KubeRay ray-project/kuberay#2600.

@kevin85421 kevin85421 added the core-autoscaler autoscaler related issues label Apr 29, 2025
@psarka
Copy link
Author

psarka commented May 27, 2025

I tried KubeRay and it is a total blast! 🎉

Nodes are not dying, and also spinning up very fast. For all I care this issue can be closed, thanks guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't clusters community-backlog core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical stability
Projects
None yet
Development

No branches or pull requests

4 participants