Skip to content

Conversation

@harborn
Copy link
Contributor

@harborn harborn commented Aug 17, 2023

Why are these changes needed?

Intel also provide common computing GPUs.
Intel internal benchmark shows that Intel GPU has great performance on LLM train/infer workflow.

This PR aim to support Intel GPU on Ray.
We add two device type as GPU: INTEL_MAX_1550, INTEL_MAX_1100.

This upgrade allows users to use INTEL GPUs almost seamlessly, just like Nvidia’s different GPU devices.

Usage of different GPU type in ray cluster

To use different GPU in ray cluster:

  1. if current ray cluster has only one GPU type, you don’t have to specify in task/actor. if no accelerator_type in task/actor options, ray will auto use the only one GPU type.
  2. if current ray cluster has more than one GPU type, and ray task/actor don't provide accelerator_type in options, ray will raise ValueError, due to ray can't decide which GPU to run the task/actor.

Such as:

from ray.util.accelerators import NVIDIA_TESLA_V100, INTEL_MAX_1550

# add a node with Nvidia GPU to cluster
cluster.add_node(num_cpus=1, num_gpus=8, resources={f"accelerator_type:{NVIDIA_TESLA_V100}": 1})

# add a node with Intel GPU to cluster
cluster.add_node(num_cpus=1, num_gpus=8, resources={f"accelerator_type:{INTEL_MAX_1550}": 1})

ray.init(address=cluster.address)

# use Nvidia GPU to train
@ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
def train(data):
    return "This function was run on a node with a Nvidia Tesla V100 GPU"

# use Intel GPU to infer
ray.get(train.remote(1))

@ray.remote(num_gpus=1, accelerator_type=INTEL_MAX_1550)
def infer(data):
    return "This function was run on a node with an Intel Max 1550 GPU"

ray.get(infer.remote(1))

The changes include 2 parts:

  1. upgrades of GPU detection process of ray.init
  2. upgrades of GPU resources usage of ray task or actor

Upgrades GPU detection process of ray.init

ray.init will autodetect all kinds of GPUs, current including:

  • Nvidia GPU
  • Intel GPU

The GPUs info will be detected during ray.init() and stored in resources field in option.

Upgrades of ray task or actor

Only one accelerator type in current ray service

# detect only one device NVIDIA_TESLA_V100, so default to use NVIDIA_TESLA_V100
@ray.remote(num_gpus=1)
def func():
    pass

@ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
def func():
    pass

@ray.remote(num_gpus=1, resources={"accelerator:NVIDIA_TESLA_V100": 1})
def func():
    pass

Multi accelerator type in current ray service

not specified accelerator type

@ray.remote(num_gpus=1)
def func():
    pass
# raise ValueError("current ray service has multi type GPU, please choose one")

specified accelerator type

# specified accelerator type
# such as use INTEL_MAX_1550
@ray.remote(num_gpus=1, accelerator_type=INTEL_MAX_1550)
def func():
    pass

@ray.remote(num_gpus=1, resources={"accelerator:INTEL_MAX_1550": 1})
def func():
    pass

Related issue number

#36493 previous implementation
#37998 auto detect aws accelerator

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@harborn harborn changed the title Support Intel GPU [Core] Support Intel GPU Aug 17, 2023
@harborn
Copy link
Contributor Author

harborn commented Aug 17, 2023

Please check this PR instead https://github.com/ray-project/ray/pull/36493
Sorry for some commit problems.
@abhilash1910 @cadedaniel @scv119

@harborn
Copy link
Contributor Author

harborn commented Aug 17, 2023

@xwu99
Please check here, Thanks.

@harborn
Copy link
Contributor Author

harborn commented Aug 17, 2023

Also updated previous comments:

  1. Add 3 UTs: test_xpu_ids, test_local_mode_xpus, test_disable_xpu_devices
  2. Change RAY_ACCELERATOR to RAY_EXPERIMENTAL_ACCELERATOR_TYPE
  3. Unified two environment variable to one: ONEAPI_DEVICE_SELECTOR, which is similar to CUDA_VISIBLE_DEVICES. While remove XPU_VISIBLE_DEVICES, which not used in IPEX 1.13 and 2.0 actually.
  4. Add some comments in codes.
  5. Only one type of accelerator can be used in ray cluster, even though there are more than 2 types accelerator in ray cluster.
  6. @xwu99 has add some update for documents.

@cadedaniel


def test_disable_xpu_devices():
script = """
import ray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe indent the quoted script:

script= """
            import ray .....

LGTM otherwise

Copy link

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM- ! Thanks

@harborn
Copy link
Contributor Author

harborn commented Aug 18, 2023

Previous comments are in https://github.com/ray-project/ray/pull/36493

@harborn
Copy link
Contributor Author

harborn commented Oct 7, 2023

@harborn

Is there a separate channel for discussions related to further integration/development (such as slack/discord etc?)

Could you reach out to me on Ray slack? We should set up a collaboration channel.

OK, reach you on Slack.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to no change the original format.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to rephrase like: The GPU type in the same node should be the same, but different node can have different types of GPUs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove redundant comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should move the long description up to the first paragraph.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove the above block as ONEAPI_DEVICE_SELECTOR already applied to dpctl.

@harborn harborn closed this Oct 19, 2023
@harborn harborn reopened this Oct 19, 2023
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a test_intel_gpu.py file and create some tests. You can see test_tpu.py as an example.

@jjyao
Copy link
Collaborator

jjyao commented Oct 19, 2023

Lint failed:



python/ray/_private/accelerators/intel_gpu.py:1:1: F401 're' imported but unused
--
  | python/ray/_private/accelerators/intel_gpu.py:3:1: F401 'sys' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:5:1: F401 'subprocess' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:6:1: F401 'importlib' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:55:18: E711 comparison to None should be 'if cond is not None:'

Comment on lines 62 to 71
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't test anything. Since we didn't mock IntelGPUAcceleratorManager.get_current_node_num_accelerators, both nodes with have Nvidia GPUs.

harborn added 9 commits October 24, 2023 23:53
Signed-off-by: harborn <[email protected]>
Signed-off-by: harborn <[email protected]>
Signed-off-by: harborn <[email protected]>
Signed-off-by: harborn <[email protected]>
Signed-off-by: harborn <[email protected]>
Signed-off-by: harborn <[email protected]>
Signed-off-by: harborn <[email protected]>
Signed-off-by: harborn <[email protected]>
Signed-off-by: harborn <[email protected]>
@jjyao
Copy link
Collaborator

jjyao commented Oct 24, 2023

Tests failed on windows

ESC_bk;t=1698165733313^G================================== FAILURES ===================================
ESC_bk;t=1698165733313^G___________________ test_get_current_node_num_accelerators ____________________
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G    def test_get_current_node_num_accelerators():
ESC_bk;t=1698165733313^G        old_dpctl = None
ESC_bk;t=1698165733313^G        if "dpctl" in sys.modules:
ESC_bk;t=1698165733313^G            old_dpctl = sys.modules["dpctl"]
ESC_bk;t=1698165733313^G    
ESC_bk;t=1698165733313^G>       sys.modules["dpctl"] = __import__("mock_dpctl_1")
ESC_bk;t=1698165733313^GE       ModuleNotFoundError: No module named 'mock_dpctl_1'
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G\\?\C:\Users\ContainerAdministrator\AppData\Local\Temp\Bazel.runfiles_s1j40za1\runfiles\com_github_ray_project_ray\python\ray\tests\accelerators\test_intel_gpu.py:38: ModuleNotFoundError
ESC_bk;t=1698165733313^G___________________ test_get_current_node_accelerator_type ____________________
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G    def test_get_current_node_accelerator_type():
ESC_bk;t=1698165733313^G        old_dpctl = None
ESC_bk;t=1698165733313^G        if "dpctl" in sys.modules:
ESC_bk;t=1698165733313^G            old_dpctl = sys.modules["dpctl"]
ESC_bk;t=1698165733313^G    
ESC_bk;t=1698165733313^G>       sys.modules["dpctl"] = __import__("mock_dpctl_1")
ESC_bk;t=1698165733313^GE       ModuleNotFoundError: No module named 'mock_dpctl_1'
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G\\?\C:\Users\ContainerAdministrator\AppData\Local\Temp\Bazel.runfiles_s1j40za1\runfiles\com_github_ray_project_ray\python\ray\tests\accelerators\test_intel_gpu.py:53: ModuleNotFoundError

Signed-off-by: harborn <[email protected]>
@jjyao jjyao merged commit 8cfc894 into ray-project:master Oct 25, 2023
@rickyyx rickyyx mentioned this pull request Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants