Skip to content

csrc/cpu/runtime/CPUPool.cpp fail for more than 1024 cpus #824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hpcpony opened this issue May 11, 2025 · 2 comments
Open

csrc/cpu/runtime/CPUPool.cpp fail for more than 1024 cpus #824

hpcpony opened this issue May 11, 2025 · 2 comments
Assignees

Comments

@hpcpony
Copy link

hpcpony commented May 11, 2025

Describe the bug

intel-extension-for-pytorch/csrc/cpu/runtime/CPUPool.cpp is written such that it will fail on systems with more than 1024 cpus. For example:

    cpu_set_t main_thread_pre_set;
    CPU_ZERO(&main_thread_pre_set);
    if (sched_getaffinity(0, sizeof(cpu_set_t), &main_thread_pre_set) != 0) {
      throw std::runtime_error("Fail to get the thread affinity information");
    }

Needs to be done using dynamically sized CPU sets (man CPU_SET(3)).

This is the first place to choke, but I suspect there may be additional code that needs correction.

Versions

v2.7.0+cpu (and others?)

@huiyan2021 huiyan2021 self-assigned this May 12, 2025
@huiyan2021
Copy link

Hi @hpcpony Thanks for reporting this issue. We are evaluating. Btw, can I know what HW platform (cpu processors, cloud, etc.) are you using for more than 1024 cpus?

@hpcpony
Copy link
Author

hpcpony commented May 12, 2025

HPE makes some quad socket machines that scale up to more than 1024 processors when hyperthreading is turned on (*).

https://buy.hpe.com/us/en/compute/mission-critical-x86-servers/superdome-flex-servers/c/1010550752
https://buy.hpe.com/us/en/compute/mission-critical-x86-servers/compute-scale-up-servers/compute-scale-up-servers/hpe-compute-scale-up-server-3200/p/1014774076

Admittedly there probably aren't many of these out there so I don't think this is a time-critical bug fix, but if there comes a point where it's easy to fix the code it's probably be worth it. The way things are going with core counts it's probably going to become a more general problem in the not to distant future.

(*) getting my sysadmins to turn off hyperthreading has not been successful ;^(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants