[Core] Support Intel GPU #38553

harborn · 2023-08-17T10:06:54Z

Why are these changes needed?

Intel also provide common computing GPUs.
Intel internal benchmark shows that Intel GPU has great performance on LLM train/infer workflow.

This PR aim to support Intel GPU on Ray.
We add two device type as GPU: INTEL_MAX_1550, INTEL_MAX_1100.

This upgrade allows users to use INTEL GPUs almost seamlessly, just like Nvidia’s different GPU devices.

Usage of different GPU type in ray cluster

To use different GPU in ray cluster:

if current ray cluster has only one GPU type, you don’t have to specify in task/actor. if no accelerator_type in task/actor options, ray will auto use the only one GPU type.
if current ray cluster has more than one GPU type, and ray task/actor don't provide accelerator_type in options, ray will raise ValueError, due to ray can't decide which GPU to run the task/actor.

Such as:

from ray.util.accelerators import NVIDIA_TESLA_V100, INTEL_MAX_1550

# add a node with Nvidia GPU to cluster
cluster.add_node(num_cpus=1, num_gpus=8, resources={f"accelerator_type:{NVIDIA_TESLA_V100}": 1})

# add a node with Intel GPU to cluster
cluster.add_node(num_cpus=1, num_gpus=8, resources={f"accelerator_type:{INTEL_MAX_1550}": 1})

ray.init(address=cluster.address)

# use Nvidia GPU to train
@ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
def train(data):
    return "This function was run on a node with a Nvidia Tesla V100 GPU"

# use Intel GPU to infer
ray.get(train.remote(1))

@ray.remote(num_gpus=1, accelerator_type=INTEL_MAX_1550)
def infer(data):
    return "This function was run on a node with an Intel Max 1550 GPU"

ray.get(infer.remote(1))

The changes include 2 parts:

upgrades of GPU detection process of ray.init
upgrades of GPU resources usage of ray task or actor

Upgrades GPU detection process of ray.init

ray.init will autodetect all kinds of GPUs, current including:

Nvidia GPU
Intel GPU

The GPUs info will be detected during ray.init() and stored in resources field in option.

Upgrades of ray task or actor

Only one accelerator type in current ray service

# detect only one device NVIDIA_TESLA_V100, so default to use NVIDIA_TESLA_V100
@ray.remote(num_gpus=1)
def func():
    pass

@ray.remote(num_gpus=1, accelerator_type=NVIDIA_TESLA_V100)
def func():
    pass

@ray.remote(num_gpus=1, resources={"accelerator:NVIDIA_TESLA_V100": 1})
def func():
    pass

Multi accelerator type in current ray service

not specified accelerator type

@ray.remote(num_gpus=1)
def func():
    pass
# raise ValueError("current ray service has multi type GPU, please choose one")

specified accelerator type

# specified accelerator type
# such as use INTEL_MAX_1550
@ray.remote(num_gpus=1, accelerator_type=INTEL_MAX_1550)
def func():
    pass

@ray.remote(num_gpus=1, resources={"accelerator:INTEL_MAX_1550": 1})
def func():
    pass

Related issue number

#36493 previous implementation
#37998 auto detect aws accelerator

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

harborn · 2023-08-17T10:12:29Z

Please check this PR instead https://github.com/ray-project/ray/pull/36493
Sorry for some commit problems.
@abhilash1910 @cadedaniel @scv119

harborn · 2023-08-17T10:23:51Z

@xwu99
Please check here, Thanks.

harborn · 2023-08-17T10:35:48Z

Also updated previous comments:

Add 3 UTs: test_xpu_ids, test_local_mode_xpus, test_disable_xpu_devices
Change RAY_ACCELERATOR to RAY_EXPERIMENTAL_ACCELERATOR_TYPE
Unified two environment variable to one: ONEAPI_DEVICE_SELECTOR, which is similar to CUDA_VISIBLE_DEVICES. While remove XPU_VISIBLE_DEVICES, which not used in IPEX 1.13 and 2.0 actually.
Add some comments in codes.
Only one type of accelerator can be used in ray cluster, even though there are more than 2 types accelerator in ray cluster.
@xwu99 has add some update for documents.

@cadedaniel

abhilash1910 · 2023-08-17T13:21:30Z

python/ray/tests/test_basic.py


+def test_disable_xpu_devices():
+    script = """
+import ray


Maybe indent the quoted script:

script= """ import ray .....

LGTM otherwise

abhilash1910

LGTM- ! Thanks

harborn · 2023-08-18T05:44:56Z

Previous comments are in https://github.com/ray-project/ray/pull/36493

harborn · 2023-10-07T01:16:36Z

@harborn

Is there a separate channel for discussions related to further integration/development (such as slack/discord etc?)

Could you reach out to me on Ray slack? We should set up a collaboration channel.

OK, reach you on Slack.

xwu-intel · 2023-10-10T06:59:45Z

python/ray/_private/resource_spec.py

it's better to no change the original format.

xwu-intel · 2023-10-10T07:04:53Z

python/ray/_private/resource_spec.py

Better to rephrase like: The GPU type in the same node should be the same, but different node can have different types of GPUs.

xwu-intel · 2023-10-10T07:05:51Z

python/ray/_private/resource_spec.py

remove redundant comment.

xwu-intel · 2023-10-10T07:14:59Z

python/ray/_private/utils.py

should move the long description up to the first paragraph.

xwu-intel · 2023-10-10T07:41:44Z

python/ray/_private/utils.py

can remove the above block as ONEAPI_DEVICE_SELECTOR already applied to dpctl.

jjyao

Could you create a test_intel_gpu.py file and create some tests. You can see test_tpu.py as an example.

python/ray/_private/accelerators/intel_gpu.py

python/ray/util/accelerators/accelerators.py

jjyao · 2023-10-19T17:15:45Z

Lint failed:



python/ray/_private/accelerators/intel_gpu.py:1:1: F401 're' imported but unused
--
  | python/ray/_private/accelerators/intel_gpu.py:3:1: F401 'sys' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:5:1: F401 'subprocess' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:6:1: F401 'importlib' imported but unused
  | python/ray/_private/accelerators/intel_gpu.py:55:18: E711 comparison to None should be 'if cond is not None:'

jjyao · 2023-10-20T15:07:05Z

python/ray/tests/accelerators/test_intel_gpu.py

This won't test anything. Since we didn't mock IntelGPUAcceleratorManager.get_current_node_num_accelerators, both nodes with have Nvidia GPUs.

python/ray/tests/accelerators/test_intel_gpu.py

python/ray/_private/accelerators/intel_gpu.py

python/ray/tests/accelerators/test_intel_gpu.py

Signed-off-by: harborn <[email protected]>

jjyao · 2023-10-24T19:53:32Z

Tests failed on windows

ESC_bk;t=1698165733313^G================================== FAILURES ===================================
ESC_bk;t=1698165733313^G___________________ test_get_current_node_num_accelerators ____________________
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G    def test_get_current_node_num_accelerators():
ESC_bk;t=1698165733313^G        old_dpctl = None
ESC_bk;t=1698165733313^G        if "dpctl" in sys.modules:
ESC_bk;t=1698165733313^G            old_dpctl = sys.modules["dpctl"]
ESC_bk;t=1698165733313^G    
ESC_bk;t=1698165733313^G>       sys.modules["dpctl"] = __import__("mock_dpctl_1")
ESC_bk;t=1698165733313^GE       ModuleNotFoundError: No module named 'mock_dpctl_1'
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G\\?\C:\Users\ContainerAdministrator\AppData\Local\Temp\Bazel.runfiles_s1j40za1\runfiles\com_github_ray_project_ray\python\ray\tests\accelerators\test_intel_gpu.py:38: ModuleNotFoundError
ESC_bk;t=1698165733313^G___________________ test_get_current_node_accelerator_type ____________________
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G    def test_get_current_node_accelerator_type():
ESC_bk;t=1698165733313^G        old_dpctl = None
ESC_bk;t=1698165733313^G        if "dpctl" in sys.modules:
ESC_bk;t=1698165733313^G            old_dpctl = sys.modules["dpctl"]
ESC_bk;t=1698165733313^G    
ESC_bk;t=1698165733313^G>       sys.modules["dpctl"] = __import__("mock_dpctl_1")
ESC_bk;t=1698165733313^GE       ModuleNotFoundError: No module named 'mock_dpctl_1'
ESC_bk;t=1698165733313^G
ESC_bk;t=1698165733313^G\\?\C:\Users\ContainerAdministrator\AppData\Local\Temp\Bazel.runfiles_s1j40za1\runfiles\com_github_ray_project_ray\python\ray\tests\accelerators\test_intel_gpu.py:53: ModuleNotFoundError

Signed-off-by: harborn <[email protected]>

harborn changed the title ~~Support Intel GPU~~ [Core] Support Intel GPU Aug 17, 2023

xwu-intel mentioned this pull request Aug 17, 2023

[WIP][DOC] Intel GPU on Ray Support #38547

Closed

8 tasks

abhilash1910 reviewed Aug 17, 2023

View reviewed changes

abhilash1910 approved these changes Aug 17, 2023

View reviewed changes

harborn requested review from a team, Yard1, amogkam, architkulkarni, aslonnie, bveeramani, c21, edoakes, ericl, gjoliver, krfricke, matthewdeng, maxpumperla, raulchen, richardliaw, scottjlee, scv119, shrekris-anyscale, sihanwang41, xwjiang2010 and zcin as code owners August 18, 2023 06:33

harborn force-pushed the ray_intel_gpu branch from 52ab210 to d09cda3 Compare October 8, 2023 06:49

xwu-intel reviewed Oct 10, 2023

View reviewed changes

harborn closed this Oct 19, 2023

harborn force-pushed the ray_intel_gpu branch from d09cda3 to b3c1424 Compare October 19, 2023 08:54

harborn reopened this Oct 19, 2023

jjyao reviewed Oct 19, 2023

View reviewed changes

harborn force-pushed the ray_intel_gpu branch from 328340d to ca007ab Compare October 20, 2023 00:39

jjyao reviewed Oct 20, 2023

View reviewed changes

harborn force-pushed the ray_intel_gpu branch from da1feca to ce3e7e6 Compare October 24, 2023 05:24

jjyao reviewed Oct 24, 2023

View reviewed changes

python/ray/tests/accelerators/test_intel_gpu.py Outdated Show resolved Hide resolved

python/ray/tests/accelerators/test_intel_gpu.py Outdated Show resolved Hide resolved

python/ray/tests/accelerators/test_intel_gpu.py Outdated Show resolved Hide resolved

jjyao reviewed Oct 24, 2023

View reviewed changes

jjyao approved these changes Oct 24, 2023

View reviewed changes

jjyao reviewed Oct 24, 2023

View reviewed changes

python/ray/tests/accelerators/test_intel_gpu.py Outdated Show resolved Hide resolved

harborn added 9 commits October 24, 2023 23:53

Support Intel GPU

66e382a

Signed-off-by: harborn <[email protected]>

fix and add ut

ab2407b

Signed-off-by: harborn <[email protected]>

fix

b86e0de

Signed-off-by: harborn <[email protected]>

add case

8a6ea17

Signed-off-by: harborn <[email protected]>

add more test cases

43247b7

Signed-off-by: harborn <[email protected]>

add mock files

02a8380

Signed-off-by: harborn <[email protected]>

fix format

bd173ab

Signed-off-by: harborn <[email protected]>

fix

83a15f9

Signed-off-by: harborn <[email protected]>

fix

9c417fd

Signed-off-by: harborn <[email protected]>

harborn force-pushed the ray_intel_gpu branch from 07ecee3 to 9c417fd Compare October 24, 2023 15:53

jjyao mentioned this pull request Oct 24, 2023

Add Apple silicon GPU(mps) support to ray #38464

Open

8 tasks

skip case on Windows

3bbc10a

Signed-off-by: harborn <[email protected]>

jjyao merged commit 8cfc894 into ray-project:master Oct 25, 2023

rickyyx mentioned this pull request Dec 7, 2023

[Core] Perf regression #41695

Closed

[Core] Support Intel GPU #38553

[Core] Support Intel GPU #38553

Uh oh!

Conversation

harborn commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Usage of different GPU type in ray cluster

Upgrades GPU detection process of ray.init

Upgrades of ray task or actor

Only one accelerator type in current ray service

Multi accelerator type in current ray service

not specified accelerator type

specified accelerator type

Related issue number

Checks

Uh oh!

harborn commented Aug 17, 2023

Uh oh!

harborn commented Aug 17, 2023

Uh oh!

harborn commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhilash1910 Aug 17, 2023

Choose a reason for hiding this comment

Uh oh!

abhilash1910 left a comment

Choose a reason for hiding this comment

Uh oh!

harborn commented Aug 18, 2023

Uh oh!

harborn commented Oct 7, 2023

Uh oh!

xwu-intel Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

xwu-intel Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

xwu-intel Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

xwu-intel Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

xwu-intel Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

jjyao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjyao commented Oct 19, 2023

Uh oh!

jjyao Oct 20, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjyao commented Oct 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

harborn commented Aug 17, 2023 •

edited

Loading

harborn commented Aug 17, 2023 •

edited

Loading