Open
Description
Describe the bug
intel-extension-for-pytorch/csrc/cpu/runtime/CPUPool.cpp is written such that it will fail on systems with more than 1024 cpus. For example:
cpu_set_t main_thread_pre_set;
CPU_ZERO(&main_thread_pre_set);
if (sched_getaffinity(0, sizeof(cpu_set_t), &main_thread_pre_set) != 0) {
throw std::runtime_error("Fail to get the thread affinity information");
}
Needs to be done using dynamically sized CPU sets (man CPU_SET(3)).
This is the first place to choke, but I suspect there may be additional code that needs correction.
Versions
v2.7.0+cpu (and others?)
Metadata
Metadata
Assignees
Labels
No labels
Activity
huiyan2021 commentedon May 12, 2025
Hi @hpcpony Thanks for reporting this issue. We are evaluating. Btw, can I know what HW platform (cpu processors, cloud, etc.) are you using for more than 1024 cpus?
hpcpony commentedon May 12, 2025
HPE makes some quad socket machines that scale up to more than 1024 processors when hyperthreading is turned on (*).
https://buy.hpe.com/us/en/compute/mission-critical-x86-servers/superdome-flex-servers/c/1010550752
https://buy.hpe.com/us/en/compute/mission-critical-x86-servers/compute-scale-up-servers/compute-scale-up-servers/hpe-compute-scale-up-server-3200/p/1014774076
Admittedly there probably aren't many of these out there so I don't think this is a time-critical bug fix, but if there comes a point where it's easy to fix the code it's probably be worth it. The way things are going with core counts it's probably going to become a more general problem in the not to distant future.
(*) getting my sysadmins to turn off hyperthreading has not been successful ;^(