Skip to content

Issue with P2P on PVC #810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
BenBrock opened this issue Apr 16, 2025 · 13 comments
Open

Issue with P2P on PVC #810

BenBrock opened this issue Apr 16, 2025 · 13 comments

Comments

@BenBrock
Copy link

BenBrock commented Apr 16, 2025

Describe the issue

I'm trying to set up IPEX on a system with 8 PVC tiles and am having difficulty getting things working. Right now, I'm just trying to run some sanity tests to ensure things work. A basic P2P test is failing.

Steps Taken So Far

  • I installed IPEX using the instructions linked in the repo. The install appears successful.
  • After sourcing oneAPI with source /opt/intel/oneapi/setvars.sh and setting my LD_LIBRARY_PATH to point to the pip's lib folder as well, the sanity test from the install instructions completes successfully with a warning (see below).

Simple P2P Check

I then tried to run a simple P2P check to measure bandwidth between devices:

#!/usr/bin/env python

import os
import sys
import time
import torch
import torch.distributed as dist
import intel_extension_for_pytorch as ipex
import oneccl_bindings_for_pytorch as torch_ccl

def get_device():
    return 'xpu:%s' % (dist.get_rank() % torch.xpu.device_count(),)

def get_rank_from_env():
    if 'PMI_RANK' in os.environ:
        return os.environ['PMI_RANK']
    elif 'PMIX_RANK' in os.environ:
        return os.environ['PMIX_RANK']
    elif 'RANK' in os.environ:
        return os.environ['RANK']
    else:
        raise Exception('Error: neither \'PMI_RANK\' nor \'RANK\' environment variable found. Are you invoking this script using mpirun or torchrun?')

def get_nprocs_from_env():
    if 'PMI_SIZE' in os.environ:
        return os.environ['PMI_SIZE']
    elif 'WORLD_SIZE' in os.environ:
        return os.environ['WORLD_SIZE']
    else:
        raise Exception('Error: neither \'PMI_SIZE\' nor \'WORLD_SIZE\' environment variable found. Are you invoking this script using mpirun or torchrun?')

os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "29500"
os.environ["RANK"] = get_rank_from_env()
os.environ["WORLD_SIZE"] = get_nprocs_from_env()
dist.init_process_group(backend="ccl", init_method="env://")

nbytes = 1024*1024*1024

n = nbytes // 4
nbytes = n * 4
gbytes = nbytes * 1e-9

print('Process %s/%s using device %s' % (dist.get_rank(), dist.get_world_size(), get_device()))

send_tensor = torch.zeros(n, dtype=torch.float32, device=get_device())
recv_tensor = torch.zeros(n, dtype=torch.float32, device=get_device())

# Perform an all_reduce to initialize communicators and such.
dist.all_reduce(send_tensor)

if dist.get_rank() == 0:
    print('Benchmarking P2P...')

for send_rank in range(dist.get_world_size()):
    for recv_rank in range(dist.get_world_size()):
        if send_rank != recv_rank:
            dist.barrier()

            if dist.get_rank() == send_rank:
                print('Send %s -> %s' % (send_rank, recv_rank))

            dist.barrier()
            begin = time.time()

            reqs = []

            if dist.get_rank() == send_rank:
                req = dist.isend(send_tensor, recv_rank)
                reqs.append(req)

            if dist.get_rank() == recv_rank:
                req = dist.irecv(recv_tensor, send_rank)
                reqs.append(req)

            for req in reqs:
                req.wait()

            end = time.time()
            duration = end - begin

            if dist.get_rank() == recv_rank:
                print('%s -> %s took %s s, achieved %s GB/s' % (send_rank, recv_rank, duration, gbytes / duration))

The output is as follows (removing the ATen warning previously mentioned):

(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> cat out.dat
My guessed rank = 4
My guessed rank = 0
My guessed rank = 1
My guessed rank = 2
My guessed rank = 3
My guessed rank = 5
My guessed rank = 6
My guessed rank = 7
Process 6/8 using device xpu:6
Process 4/8 using device xpu:4
Process 5/8 using device xpu:5
Process 3/8 using device xpu:3
Process 7/8 using device xpu:7
Process 2/8 using device xpu:2
Process 0/8 using device xpu:0
Process 1/8 using device xpu:1
Benchmarking P2P...
Send 0 -> 1
0 -> 1 took 0.30544233322143555 s, achieved 3.51536675573249 GB/s
Send 0 -> 2
  1. It blocks indefinitely on the send from 0 -> 2.

  2. The bandwidth is way lower than expected. It reports 3.5 GB/s, when it should be >150 GB/s (over MDFI between tiles 0 and 1, and Xe Link would be 20 GB/s).

My GPUs on the system appear to be configured correctly:

(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Platinum 8480+ OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:5] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:6] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:7] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:8] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[opencl:gpu:9] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [25.05.32567]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:4] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:5] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:6] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
[ext_oneapi_level_zero:gpu:7] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.6 [1.3.32567]
(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> xpu-smi topology -m
         GPU 0/0  GPU 0/1  GPU 1/0  GPU 1/1  GPU 2/0  GPU 2/1  GPU 3/0  GPU 3/1  CPU Affinity
GPU 0/0  S        MDF      XL*      XL8      XL8      XL*      XL8      XL*      0-55,112-167
GPU 0/1  MDF      S        XL8      XL*      XL*      XL8      XL*      XL8      0-55,112-167
GPU 1/0  XL*      XL8      S        MDF      XL*      XL8      XL*      XL8      0-55,112-167
GPU 1/1  XL8      XL*      MDF      S        XL8      XL*      XL8      XL*      0-55,112-167
GPU 2/0  XL8      XL*      XL*      XL8      S        MDF      XL8      XL*      56-111,168-223
GPU 2/1  XL*      XL8      XL8      XL*      MDF      S        XL*      XL8      56-111,168-223
GPU 3/0  XL8      XL*      XL*      XL8      XL8      XL*      S        MDF      56-111,168-223
GPU 3/1  XL*      XL8      XL8      XL*      XL*      XL8      MDF      S        56-111,168-223

Please advise on what to do. I get the same results whether using the mpirun bundled with pip or the system's Intel MPI.

Sanity Check Warning

The warning produced by the sanity check after install is about ATen op registration. I saw someone was told in another issue that this warning can be ignored, so I'm ignoring it and assuming this is successful.

(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> !pyth
python3 -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.__version__); print(ipex.__version__); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"
[W416 15:51:01.385586326 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
2.6.0+xpu
2.6.10+xpu
[0]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[1]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[2]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[3]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[4]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[5]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[6]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[7]: _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', driver_version='1.6.32567+18', total_memory=65536MB, max_compute_units=512, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
[W416 15:51:04.037020987 OperatorEntry.cpp:154] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_validate_compressed_sparse_indices(bool is_crow, Tensor compressed_idx, Tensor plain_idx, int cdim, int dim, int nnz) -> ()
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/build/aten/src/ATen/RegisterCPU.cpp:30477
       new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:468 (function operator())
@jingxu10
Copy link
Contributor

Could you check if this torch-ccl demo script works?
https://github.com/intel/torch-ccl/tree/master/demo

@BenBrock
Copy link
Author

The demo script seems to work:

(ipex-new) bbrock@hedp030:~/src/ai/pytorch-gemm/ccl> mpirun -n 8 -l python demo.py --device xpu
[0] [rank0]:[W416 22:45:40.723224165 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[2] [rank2]:[W416 22:45:40.723223113 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[3] [rank3]:[W416 22:45:40.723229753 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[4] [rank4]:[W416 22:45:40.723227604 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[5] [rank5]:[W416 22:45:40.723257697 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[7] [rank7]:[W416 22:45:40.726658140 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[1] [rank1]:[W416 22:45:40.730492019 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[6] [rank6]:[W416 22:45:40.731073464 reducer.cpp:69] Warning: measureDifference between two events is not supported on XPU backend! (function operator())
[2] My guessed rank = 2
[0] My guessed rank = 0
[1] My guessed rank = 1
[3] My guessed rank = 3
[4] My guessed rank = 4
[5] My guessed rank = 5
[6] My guessed rank = 6
[7] My guessed rank = 7
[6] Runing Iteration: 0 on device xpu:6
[6] Runing forward: 0 on device xpu:6
[6] Runing loss: 0 on device xpu:6
[6] Runing backward: 0 on device xpu:6
[6] Runing optim: 0 on device xpu:6
[6] Runing Iteration: 1 on device xpu:6
[6] Runing forward: 1 on device xpu:6
[6] Runing loss: 1 on device xpu:6
[6] Runing backward: 1 on device xpu:6
[6] Runing optim: 1 on device xpu:6
[6] Runing Iteration: 2 on device xpu:6
[6] Runing forward: 2 on device xpu:6
[6] Runing loss: 2 on device xpu:6
[6] Runing backward: 2 on device xpu:6
[6] Runing optim: 2 on device xpu:6
[6] Finish
[4] Runing Iteration: 0 on device xpu:4
[4] Runing forward: 0 on device xpu:4
[4] Runing loss: 0 on device xpu:4
[4] Runing backward: 0 on device xpu:4
[4] Runing optim: 0 on device xpu:4
[4] Runing Iteration: 1 on device xpu:4
[4] Runing forward: 1 on device xpu:4
[4] Runing loss: 1 on device xpu:4
[4] Runing backward: 1 on device xpu:4
[4] Runing optim: 1 on device xpu:4
[4] Runing Iteration: 2 on device xpu:4
[4] Runing forward: 2 on device xpu:4
[4] Runing loss: 2 on device xpu:4
[4] Runing backward: 2 on device xpu:4
[4] Runing optim: 2 on device xpu:4
[4] Finish
[1] Runing Iteration: 0 on device xpu:1
[1] Runing forward: 0 on device xpu:1
[1] Runing loss: 0 on device xpu:1
[1] Runing backward: 0 on device xpu:1
[1] Runing optim: 0 on device xpu:1
[1] Runing Iteration: 1 on device xpu:1
[1] Runing forward: 1 on device xpu:1
[1] Runing loss: 1 on device xpu:1
[1] Runing backward: 1 on device xpu:1
[1] Runing optim: 1 on device xpu:1
[1] Runing Iteration: 2 on device xpu:1
[1] Runing forward: 2 on device xpu:1
[1] Runing loss: 2 on device xpu:1
[1] Runing backward: 2 on device xpu:1
[1] Runing optim: 2 on device xpu:1
[1] Finish
[7] Runing Iteration: 0 on device xpu:7
[7] Runing forward: 0 on device xpu:7
[7] Runing loss: 0 on device xpu:7
[7] Runing backward: 0 on device xpu:7
[7] Runing optim: 0 on device xpu:7
[7] Runing Iteration: 1 on device xpu:7
[7] Runing forward: 1 on device xpu:7
[7] Runing loss: 1 on device xpu:7
[7] Runing backward: 1 on device xpu:7
[7] Runing optim: 1 on device xpu:7
[7] Runing Iteration: 2 on device xpu:7
[7] Runing forward: 2 on device xpu:7
[7] Runing loss: 2 on device xpu:7
[7] Runing backward: 2 on device xpu:7
[7] Runing optim: 2 on device xpu:7
[7] Finish
[3] Runing Iteration: 0 on device xpu:3
[3] Runing forward: 0 on device xpu:3
[3] Runing loss: 0 on device xpu:3
[3] Runing backward: 0 on device xpu:3
[3] Runing optim: 0 on device xpu:3
[3] Runing Iteration: 1 on device xpu:3
[3] Runing forward: 1 on device xpu:3
[3] Runing loss: 1 on device xpu:3
[3] Runing backward: 1 on device xpu:3
[3] Runing optim: 1 on device xpu:3
[3] Runing Iteration: 2 on device xpu:3
[3] Runing forward: 2 on device xpu:3
[3] Runing loss: 2 on device xpu:3
[3] Runing backward: 2 on device xpu:3
[3] Runing optim: 2 on device xpu:3
[3] Finish
[5] Runing Iteration: 0 on device xpu:5
[5] Runing forward: 0 on device xpu:5
[5] Runing loss: 0 on device xpu:5
[5] Runing backward: 0 on device xpu:5
[5] Runing optim: 0 on device xpu:5
[5] Runing Iteration: 1 on device xpu:5
[5] Runing forward: 1 on device xpu:5
[5] Runing loss: 1 on device xpu:5
[5] Runing backward: 1 on device xpu:5
[5] Runing optim: 1 on device xpu:5
[5] Runing Iteration: 2 on device xpu:5
[5] Runing forward: 2 on device xpu:5
[5] Runing loss: 2 on device xpu:5
[5] Runing backward: 2 on device xpu:5
[5] Runing optim: 2 on device xpu:5
[5] Finish
[2] Runing Iteration: 0 on device xpu:2
[2] Runing forward: 0 on device xpu:2
[2] Runing loss: 0 on device xpu:2
[2] Runing backward: 0 on device xpu:2
[2] Runing optim: 0 on device xpu:2
[2] Runing Iteration: 1 on device xpu:2
[2] Runing forward: 1 on device xpu:2
[2] Runing loss: 1 on device xpu:2
[2] Runing backward: 1 on device xpu:2
[2] Runing optim: 1 on device xpu:2
[2] Runing Iteration: 2 on device xpu:2
[2] Runing forward: 2 on device xpu:2
[2] Runing loss: 2 on device xpu:2
[2] Runing backward: 2 on device xpu:2
[2] Runing optim: 2 on device xpu:2
[2] Finish
[0] Runing Iteration: 0 on device xpu:0
[0] Runing forward: 0 on device xpu:0
[0] Runing loss: 0 on device xpu:0
[0] Runing backward: 0 on device xpu:0
[0] Runing optim: 0 on device xpu:0
[0] Runing Iteration: 1 on device xpu:0
[0] Runing forward: 1 on device xpu:0
[0] Runing loss: 1 on device xpu:0
[0] Runing backward: 1 on device xpu:0
[0] Runing optim: 1 on device xpu:0
[0] Runing Iteration: 2 on device xpu:0
[0] Runing forward: 2 on device xpu:0
[0] Runing loss: 2 on device xpu:0
[0] Runing backward: 2 on device xpu:0
[0] Runing optim: 2 on device xpu:0
[0] Finish
[4] My guessed rank = 4
[6] My guessed rank = 6
[7] My guessed rank = 7
[5] My guessed rank = 5
[1] My guessed rank = 1
[0] My guessed rank = 0
[3] My guessed rank = 3
[2] My guessed rank = 2

(I'm again leaving out the ATen warnings I mentioned previously.)

@BenBrock
Copy link
Author

@jingxu10 Wondering if there are any more tests I should run to make sure things are running correctly? Or if the CCL features I'm using are expected to work on an 8-tile PVC system?

@jingxu10
Copy link
Contributor

probably some environment variable misconfigured. I'm checking internally and will reach back to you later this week.

@jingxu10
Copy link
Contributor

torch-ccl will be deprecated by the newly XCCL backend in the latest PyTorch. I'm checking for BKMs and will share it to you early next week.

@BenBrock
Copy link
Author

Thanks—in that case, I'd like to get up and running with XCCL. Would appreciate details on how to get up and running.

@jingxu10
Copy link
Contributor

We just got the XCCL backend enabled in PyTorch nightly build. Please have a try, with the sample code below as well.

Env setup command:

conda create -y -n xccl_py310 python=3.10
conda activate xccl_py310
python -m pip install torch --index-url https://download.pytorch.org/whl/nightly/xpu
conda deactivate
conda activate xccl_py310

Sample code elastic_ddp.py:

import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim

from torch.nn.parallel import DistributedDataParallel as DDP

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic():
    torch.xpu.set_device(int(os.environ["LOCAL_RANK"]))
    dist.init_process_group("xccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.")
    # create model and move it to GPU with id rank
    device_id = rank % torch.xpu.device_count()
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()
    dist.destroy_process_group()
    print(f"Finished running basic DDP example on rank {rank}.")

if __name__ == "__main__":
    demo_basic()

Command to launch:
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py

@BenBrock
Copy link
Author

Hi @jingxu10, I don't have the MASTER_ADDR environment variable defined, and if I replace that with 127.0.0.1, the program blocks for about a minute and then fails:

(xccl_py310) bbrock@hedp017:~/src/ai/pytorch-distributed> torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$127.0.0.1:29400 elastic_ddp.py
W0523 18:43:18.889000 168305 site-packages/torch/distributed/run.py:766]
W0523 18:43:18.889000 168305 site-packages/torch/distributed/run.py:766] *****************************************
W0523 18:43:18.889000 168305 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0523 18:43:18.889000 168305 site-packages/torch/distributed/run.py:766] *****************************************
[E523 18:44:18.931488695 socket.cpp:1019] [c10d] The client socket has timed out after 60000ms while trying to connect to (27.0.0.1, 29400).
[E523 18:44:18.939377705 TCPStore.cpp:328] [c10d] TCP client failed to connect/validate to host 27.0.0.1:29400 - timed out (try=0, timeout=60000ms): The client socket has timed out after 60000ms while trying to connect to (27.0.0.1, 29400).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1021 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x15363d580678 in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5a17f5e (0x153624152f5e in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x116030d (0x15361f89b30d in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5a6599d (0x1536241a099d in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x5a65b49 (0x1536241a0b49 in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x5a65f01 (0x1536241a0f01 in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x5a12a4b (0x15362414da4b in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) + 0x4a4 (0x153624151394 in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xbec755 (0x153632bd1755 in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0xc1faf4 (0x153632c04af4 in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x37e7de (0x1536323637de in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x13d0e6 (0x5593a3aa00e6 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #12: _PyObject_MakeTpCall + 0x2d3 (0x5593a3a990b3 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #13: <unknown function> + 0x1490b6 (0x5593a3aac0b6 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #14: PyVectorcall_Call + 0xc9 (0x5593a3aacc59 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #15: <unknown function> + 0x146cc4 (0x5593a3aa9cc4 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #16: <unknown function> + 0x1363bb (0x5593a3a993bb in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #17: <unknown function> + 0x37d36b (0x15363236236b in /home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #18: _PyObject_MakeTpCall + 0x2d3 (0x5593a3a990b3 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x5362 (0x5593a3a95332 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #20: _PyFunction_Vectorcall + 0x6c (0x5593a3aa056c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x30c (0x5593a3a902dc in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #22: _PyFunction_Vectorcall + 0x6c (0x5593a3aa056c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x30c (0x5593a3a902dc in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #24: _PyFunction_Vectorcall + 0x6c (0x5593a3aa056c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x30c (0x5593a3a902dc in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #26: _PyFunction_Vectorcall + 0x6c (0x5593a3aa056c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x700 (0x5593a3a906d0 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #28: _PyFunction_Vectorcall + 0x6c (0x5593a3aa056c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x498f (0x5593a3a9495f in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #30: _PyFunction_Vectorcall + 0x6c (0x5593a3aa056c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x30c (0x5593a3a902dc in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #32: _PyObject_FastCallDictTstate + 0xd0 (0x5593a3a983d0 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #33: _PyObject_Call_Prepend + 0x69 (0x5593a3aaa279 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #34: <unknown function> + 0x209249 (0x5593a3b6c249 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #35: PyObject_Call + 0x20f (0x5593a3aaca2f in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x2c2a (0x5593a3a92bfa in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #37: _PyFunction_Vectorcall + 0x6c (0x5593a3aa056c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x30c (0x5593a3a902dc in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #39: _PyFunction_Vectorcall + 0x6c (0x5593a3aa056c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x2c2a (0x5593a3a92bfa in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #41: _PyFunction_Vectorcall + 0x6c (0x5593a3aa056c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x30c (0x5593a3a902dc in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #43: <unknown function> + 0x1cfd2c (0x5593a3b32d2c in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #44: PyEval_EvalCode + 0x87 (0x5593a3b32c77 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #45: <unknown function> + 0x2001da (0x5593a3b631da in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #46: <unknown function> + 0x1fb663 (0x5593a3b5e663 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #47: <unknown function> + 0x975bf (0x5593a39fa5bf in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #48: _PyRun_SimpleFileObject + 0x1bd (0x5593a3b58e9d in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #49: _PyRun_AnyFileObject + 0x44 (0x5593a3b58a34 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #50: Py_RunMain + 0x31b (0x5593a3b55d9b in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #51: Py_BytesMain + 0x37 (0x5593a3b26897 in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)
frame #52: <unknown function> + 0x40e6c (0x15363e0dde6c in /lib64/libc.so.6)
frame #53: __libc_start_main + 0x87 (0x15363e0ddf35 in /lib64/libc.so.6)
frame #54: <unknown function> + 0x1c37ae (0x5593a3b267ae in /home/bbrock/miniforge3/envs/xccl_py310/bin/python)

Traceback (most recent call last):
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 159, in _create_tcp_store
    store = TCPStore(
torch.distributed.DistNetworkError: The client socket has timed out after 60000ms while trying to connect to (27.0.0.1, 29400).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bbrock/miniforge3/envs/xccl_py310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 242, in launch_agent
    rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 100, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py", line 376, in create_handler
    handler = creator(params)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 51, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 257, in create_backend
    store = _create_tcp_store(params)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 183, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

I don't seem to be able to get it to run using mpirun either.

@jingxu10
Copy link
Contributor

jingxu10 commented May 24, 2025

TCP client failed to connect/validate to host 27.0.0.1:29400
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$127.0.0.1:29400 elastic_ddp.py
Could you try again with removing the $ in the --rdzv_endpoint argument? The master address was parsed incorrectly.

@BenBrock
Copy link
Author

Ah, missed that. Here's the output with that removed. Seems to have difficulty finding the correct transport?

(xccl_py310) bbrock@hedp017:~/src/ai/pytorch-distributed> torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29400 elastic_ddp.py
W0523 19:04:17.063000 168882 site-packages/torch/distributed/run.py:766]
W0523 19:04:17.063000 168882 site-packages/torch/distributed/run.py:766] *****************************************
W0523 19:04:17.063000 168882 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0523 19:04:17.063000 168882 site-packages/torch/distributed/run.py:766] *****************************************
Start running basic DDP example on rank 6.
Start running basic DDP example on rank 1.
Start running basic DDP example on rank 7.
Start running basic DDP example on rank 2.
Start running basic DDP example on rank 3.
Start running basic DDP example on rank 4.
Start running basic DDP example on rank 5.
Start running basic DDP example on rank 0.
2025:05:23-19:04:19:(168949) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi2025:05:23-19:04:19:(168952) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi2025:05:23-19:04:19:(168954) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi2025:05:23-19:04:19:(168955) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi2025:05:23-19:04:19:(168956) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi2025:05:23-19:04:19:(168953) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:05:23-19:04:19:(168951) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
2025:05:23-19:04:19:(168950) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi





2025:05:23-19:04:19:(168949) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL2025:05:23-19:04:19:(168952) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL

2025:05:23-19:04:19:(168956) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL2025:05:23-19:04:19:(168955) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:05:23-19:04:19:(168953) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL2025:05:23-19:04:19:(168951) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL2025:05:23-19:04:19:(168950) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:05:23-19:04:19:(168954) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL



[1748048660.695946664] hedp017:rank4.python: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
2025:05:23-19:04:20:(168953) |CCL_ERROR| atl_ofi_helper.cpp:1204 atl_ofi_prov_init: fi_scalable_ep(prov->domain, info, &prov->sep, nullptr)
 fails with ret: -12, strerror: Cannot allocate memory
2025:05:23-19:04:20:(168953) |CCL_ERROR| atl_ofi_helper.cpp:1242 atl_ofi_prov_init: can't init provider psm3:autoselect_one:mlx5_0;mlx5_2
2025:05:23-19:04:20:(168953) |CCL_ERROR| atl_ofi_helper.cpp:1608 atl_ofi_open_nw_provs: atl_ofi_prov_init(ctx, coord, final_provs[idx], prov, attr, pmi, ep_names[prov->idx])
 fails with status: 1
2025:05:23-19:04:20:(168953) |CCL_ERROR| atl_ofi_helper.cpp:1635 atl_ofi_open_nw_provs: can not open network providers
2025:05:23-19:04:20:(168953) |CCL_ERROR| atl_ofi.cpp:1149 open_providers: atl_ofi_open_nw_provs failed with status: 1
2025:05:23-19:04:20:(168953) |CCL_ERROR| atl_ofi.cpp:162 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 1
2025:05:23-19:04:20:(168953) |CCL_ERROR| atl_ofi.cpp:237 init: can't find suitable provider
[hedp017:168953:0:168953] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
[1748048660.701595457] hedp017:rank0.python: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
[1748048660.702716485] hedp017:rank3.python: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
2025:05:23-19:04:20:(168949) |CCL_ERROR| atl_ofi_helper.cpp:1204 atl_ofi_prov_init: fi_scalable_ep(prov->domain, info, &prov->sep, nullptr)
 fails with ret: -12, strerror: Cannot allocate memory
2025:05:23-19:04:20:(168949) |CCL_ERROR| atl_ofi_helper.cpp:1242 atl_ofi_prov_init: can't init provider psm3:autoselect_one:mlx5_0;mlx5_2
2025:05:23-19:04:20:(168949) |CCL_ERROR| atl_ofi_helper.cpp:1608 atl_ofi_open_nw_provs: atl_ofi_prov_init(ctx, coord, final_provs[idx], prov, attr, pmi, ep_names[prov->idx])
 fails with status: 1
2025:05:23-19:04:20:(168949) |CCL_ERROR| atl_ofi_helper.cpp:1635 atl_ofi_open_nw_provs: can not open network providers
2025:05:23-19:04:20:(168949) |CCL_ERROR| atl_ofi.cpp:1149 open_providers: atl_ofi_open_nw_provs failed with status: 1
2025:05:23-19:04:20:(168949) |CCL_ERROR| atl_ofi.cpp:162 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 1
2025:05:23-19:04:20:(168949) |CCL_ERROR| atl_ofi.cpp:237 init: can't find suitable provider
[hedp017:168949:0:168949] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
2025:05:23-19:04:20:(168952) |CCL_ERROR| atl_ofi_helper.cpp:1204 atl_ofi_prov_init: fi_scalable_ep(prov->domain, info, &prov->sep, nullptr)
 fails with ret: -12, strerror: Cannot allocate memory
2025:05:23-19:04:20:(168952) |CCL_ERROR| atl_ofi_helper.cpp:1242 atl_ofi_prov_init: can't init provider psm3:autoselect_one:mlx5_0;mlx5_2
2025:05:23-19:04:20:(168952) |CCL_ERROR| atl_ofi_helper.cpp:1608 atl_ofi_open_nw_provs: atl_ofi_prov_init(ctx, coord, final_provs[idx], prov, attr, pmi, ep_names[prov->idx])
 fails with status: 1
2025:05:23-19:04:20:(168952) |CCL_ERROR| atl_ofi_helper.cpp:1635 atl_ofi_open_nw_provs: can not open network providers
2025:05:23-19:04:20:(168952) |CCL_ERROR| atl_ofi.cpp:1149 open_providers: atl_ofi_open_nw_provs failed with status: 1
2025:05:23-19:04:20:(168952) |CCL_ERROR| atl_ofi.cpp:162 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 1
2025:05:23-19:04:20:(168952) |CCL_ERROR| atl_ofi.cpp:237 init: can't find suitable provider
[hedp017:168952:0:168952] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
[1748048660.705677970] hedp017:rank2.python: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
[1748048660.706748877] hedp017:rank5.python: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
2025:05:23-19:04:20:(168951) |CCL_ERROR| atl_ofi_helper.cpp:1204 atl_ofi_prov_init: fi_scalable_ep(prov->domain, info, &prov->sep, nullptr)
 fails with ret: -12, strerror: Cannot allocate memory
2025:05:23-19:04:20:(168951) |CCL_ERROR| atl_ofi_helper.cpp:1242 atl_ofi_prov_init: can't init provider psm3:autoselect_one:mlx5_0;mlx5_2
2025:05:23-19:04:20:(168951) |CCL_ERROR| atl_ofi_helper.cpp:1608 atl_ofi_open_nw_provs: atl_ofi_prov_init(ctx, coord, final_provs[idx], prov, attr, pmi, ep_names[prov->idx])
 fails with status: 1
2025:05:23-19:04:20:(168951) |CCL_ERROR| atl_ofi_helper.cpp:1635 atl_ofi_open_nw_provs: can not open network providers
2025:05:23-19:04:20:(168951) |CCL_ERROR| atl_ofi.cpp:1149 open_providers: atl_ofi_open_nw_provs failed with status: 1
2025:05:23-19:04:20:(168951) |CCL_ERROR| atl_ofi.cpp:162 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 1
2025:05:23-19:04:20:(168951) |CCL_ERROR| atl_ofi.cpp:237 init: can't find suitable provider
[hedp017:168951:0:168951] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
2025:05:23-19:04:20:(168954) |CCL_ERROR| atl_ofi_helper.cpp:1204 atl_ofi_prov_init: fi_scalable_ep(prov->domain, info, &prov->sep, nullptr)
 fails with ret: -12, strerror: Cannot allocate memory
2025:05:23-19:04:20:(168954) |CCL_ERROR| atl_ofi_helper.cpp:1242 atl_ofi_prov_init: can't init provider psm3:autoselect_one:mlx5_0;mlx5_2
2025:05:23-19:04:20:(168954) |CCL_ERROR| atl_ofi_helper.cpp:1608 atl_ofi_open_nw_provs: atl_ofi_prov_init(ctx, coord, final_provs[idx], prov, attr, pmi, ep_names[prov->idx])
 fails with status: 1
2025:05:23-19:04:20:(168954) |CCL_ERROR| atl_ofi_helper.cpp:1635 atl_ofi_open_nw_provs: can not open network providers
2025:05:23-19:04:20:(168954) |CCL_ERROR| atl_ofi.cpp:1149 open_providers: atl_ofi_open_nw_provs failed with status: 1
2025:05:23-19:04:20:(168954) |CCL_ERROR| atl_ofi.cpp:162 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 1
2025:05:23-19:04:20:(168954) |CCL_ERROR| atl_ofi.cpp:237 init: can't find suitable provider
[hedp017:168954:0:168954] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
[1748048660.708161736] hedp017:rank1.python: Failed to modify UD QP to INIT on mlx5_2: Operation not permitted
2025:05:23-19:04:20:(168950) |CCL_ERROR| atl_ofi_helper.cpp:1204 atl_ofi_prov_init: fi_scalable_ep(prov->domain, info, &prov->sep, nullptr)
 fails with ret: -12, strerror: Cannot allocate memory
2025:05:23-19:04:20:(168950) |CCL_ERROR| atl_ofi_helper.cpp:1242 atl_ofi_prov_init: can't init provider psm3:autoselect_one:mlx5_0;mlx5_2
2025:05:23-19:04:20:(168950) |CCL_ERROR| atl_ofi_helper.cpp:1608 atl_ofi_open_nw_provs: atl_ofi_prov_init(ctx, coord, final_provs[idx], prov, attr, pmi, ep_names[prov->idx])
 fails with status: 1
2025:05:23-19:04:20:(168950) |CCL_ERROR| atl_ofi_helper.cpp:1635 atl_ofi_open_nw_provs: can not open network providers
2025:05:23-19:04:20:(168950) |CCL_ERROR| atl_ofi.cpp:1149 open_providers: atl_ofi_open_nw_provs failed with status: 1
2025:05:23-19:04:20:(168950) |CCL_ERROR| atl_ofi.cpp:162 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 1
2025:05:23-19:04:20:(168950) |CCL_ERROR| atl_ofi.cpp:237 init: can't find suitable provider
[hedp017:168950:0:168950] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
[1748048660.725290167] hedp017:rank6.python: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
2025:05:23-19:04:20:(168955) |CCL_ERROR| atl_ofi_helper.cpp:1204 atl_ofi_prov_init: fi_scalable_ep(prov->domain, info, &prov->sep, nullptr)
 fails with ret: -12, strerror: Cannot allocate memory
2025:05:23-19:04:20:(168955) |CCL_ERROR| atl_ofi_helper.cpp:1242 atl_ofi_prov_init: can't init provider psm3:autoselect_one:mlx5_0;mlx5_2
2025:05:23-19:04:20:(168955) |CCL_ERROR| atl_ofi_helper.cpp:1608 atl_ofi_open_nw_provs: atl_ofi_prov_init(ctx, coord, final_provs[idx], prov, attr, pmi, ep_names[prov->idx])
 fails with status: 1
2025:05:23-19:04:20:(168955) |CCL_ERROR| atl_ofi_helper.cpp:1635 atl_ofi_open_nw_provs: can not open network providers
2025:05:23-19:04:20:(168955) |CCL_ERROR| atl_ofi.cpp:1149 open_providers: atl_ofi_open_nw_provs failed with status: 1
2025:05:23-19:04:20:(168955) |CCL_ERROR| atl_ofi.cpp:162 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 1
2025:05:23-19:04:20:(168955) |CCL_ERROR| atl_ofi.cpp:237 init: can't find suitable provider
[hedp017:168955:0:168955] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
[1748048660.727622923] hedp017:rank7.python: Failed to modify UD QP to INIT on mlx5_2: Operation not permitted
2025:05:23-19:04:20:(168956) |CCL_ERROR| atl_ofi_helper.cpp:1204 atl_ofi_prov_init: fi_scalable_ep(prov->domain, info, &prov->sep, nullptr)
 fails with ret: -12, strerror: Cannot allocate memory
2025:05:23-19:04:20:(168956) |CCL_ERROR| atl_ofi_helper.cpp:1242 atl_ofi_prov_init: can't init provider psm3:autoselect_one:mlx5_0;mlx5_2
2025:05:23-19:04:20:(168956) |CCL_ERROR| atl_ofi_helper.cpp:1608 atl_ofi_open_nw_provs: atl_ofi_prov_init(ctx, coord, final_provs[idx], prov, attr, pmi, ep_names[prov->idx])
 fails with status: 1
2025:05:23-19:04:20:(168956) |CCL_ERROR| atl_ofi_helper.cpp:1635 atl_ofi_open_nw_provs: can not open network providers
2025:05:23-19:04:20:(168956) |CCL_ERROR| atl_ofi.cpp:1149 open_providers: atl_ofi_open_nw_provs failed with status: 1
2025:05:23-19:04:20:(168956) |CCL_ERROR| atl_ofi.cpp:162 init: open_providers(prov_env, coord, attr, base_hints, open_nw_provs, fi_version, pmi, true )
 fails with status: 1
2025:05:23-19:04:20:(168956) |CCL_ERROR| atl_ofi.cpp:237 init: can't find suitable provider
[hedp017:168956:0:168956] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid: 168953) ====
 0 0x0000000000057900 __GI___sigaction()  ???:0
 1 0x0000000001c920a7 atl_ofi_prov_destroy()  :0
 2 0x0000000001c7752b atl_ofi::finalize()  :0
 3 0x0000000001c6b4b7 atl_ofi::init()  :0
 4 0x0000000001c81fac atl_ofi_comm::init_transport()  :0
 5 0x0000000001c83523 atl_ofi_comm::atl_ofi_comm()  :0
 6 0x0000000001c5378f atl_comm_manager::create()  :0
 7 0x0000000001e095b2 ccl_comm::create()  ???:0
 8 0x0000000001e3afdf ccl::comm_selector::create_comm_impl()  ???:0
 9 0x0000000001fd32cf ccl::v1::communicator::create_communicators<ccl::v1::device, ccl::v1::context>()  ???:0
10 0x0000000001fd70f4 ccl::detail::environment::create_communicators<ccl::v1::device, ccl::v1::context>()  ???:0
11 0x0000000006799982 c10d::ProcessGroupXCCL::getXCCLComm()  ???:0
12 0x00000000067a8d40 c10d::ProcessGroupXCCL::allgather()  ???:0
13 0x00000000067c34e1 c10d::ops::(anonymous namespace)::allgather_XPU()  Register.cpp:0
14 0x00000000067cac09 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call()  :0
15 0x000000000505f7a0 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
16 0x00000000059bd706 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), void>::call()  :0
17 0x00000000059c9d9e c10d::ProcessGroup::allgather()  :0
18 0x0000000005a5101a c10d::verify_params_across_processes()  ???:0
19 0x0000000000bea2fd pybind11::detail::argument_loader<c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&>::call<void, pybind11::gil_scoped_release, torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#109}&>()  init.cpp:0
20 0x0000000000c1c894 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#109}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#109}&&, void (*)(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
21 0x000000000037e7de pybind11::cpp_function::dispatcher()  :0
22 0x000000000013d0e6 cfunction_call()  /usr/local/src/conda/python-3.10.17/Objects/methodobject.c:543
23 0x000000000013d0e6 _Py_CheckFunctionResult()  /usr/local/src/conda/python-3.10.17/Objects/call.c:39
24 0x000000000013d0e6 cfunction_call()  /usr/local/src/conda/python-3.10.17/Objects/methodobject.c:554
25 0x00000000001360b3 _PyObject_MakeTpCall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:215
26 0x00000000001321b6 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:112
27 0x00000000001321b6 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:99
28 0x00000000001321b6 PyObject_Vectorcall()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:123
29 0x00000000001321b6 call_function()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5893
30 0x00000000001321b6 _PyEval_EvalFrameDefault()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:4181
31 0x000000000013d56c _PyEval_EvalFrame()  /usr/local/src/conda/python-3.10.17/Include/internal/pycore_ceval.h:46
32 0x000000000013d56c _PyEval_Vector()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5067
33 0x000000000013d56c _PyFunction_Vectorcall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:342
34 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:114
35 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:115
36 0x000000000012d2dc PyObject_Vectorcall()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:123
37 0x000000000012d2dc call_function()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5893
38 0x000000000012d2dc _PyEval_EvalFrameDefault()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:4213
39 0x000000000013d56c _PyEval_EvalFrame()  /usr/local/src/conda/python-3.10.17/Include/internal/pycore_ceval.h:46
40 0x000000000013d56c _PyEval_Vector()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5067
41 0x000000000013d56c _PyFunction_Vectorcall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:342
42 0x0000000000135487 _PyObject_FastCallDictTstate()  /usr/local/src/conda/python-3.10.17/Objects/call.c:153
43 0x0000000000135487 _PyObject_FastCallDictTstate()  /usr/local/src/conda/python-3.10.17/Objects/call.c:155
44 0x0000000000146799 _PyObject_Call_Prepend()  /usr/local/src/conda/python-3.10.17/Objects/call.c:431
45 0x0000000000146799 slot_tp_init()  /usr/local/src/conda/python-3.10.17/Objects/typeobject.c:7734
46 0x00000000001360cb type_call()  /usr/local/src/conda/python-3.10.17/Objects/typeobject.c:1135
47 0x00000000001360cb _PyObject_MakeTpCall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:215
48 0x0000000000132332 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:112
49 0x0000000000132332 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:99
50 0x0000000000132332 PyObject_Vectorcall()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:123
51 0x0000000000132332 call_function()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5893
52 0x0000000000132332 _PyEval_EvalFrameDefault()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:4231
53 0x000000000013d56c _PyEval_EvalFrame()  /usr/local/src/conda/python-3.10.17/Include/internal/pycore_ceval.h:46
54 0x000000000013d56c _PyEval_Vector()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5067
55 0x000000000013d56c _PyFunction_Vectorcall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:342
56 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:114
57 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:115
=================================
==== backtrace (tid: 168954) ====
 0 0x0000000000057900 __GI___sigaction()  ???:0
 1 0x0000000001c920a7 atl_ofi_prov_destroy()  :0
 2 0x0000000001c7752b atl_ofi::finalize()  :0
 3 0x0000000001c6b4b7 atl_ofi::init()  :0
 4 0x0000000001c81fac atl_ofi_comm::init_transport()  :0
 5 0x0000000001c83523 atl_ofi_comm::atl_ofi_comm()  :0
 6 0x0000000001c5378f atl_comm_manager::create()  :0
 7 0x0000000001e095b2 ccl_comm::create()  ???:0
 8 0x0000000001e3afdf ccl::comm_selector::create_comm_impl()  ???:0
 9 0x0000000001fd32cf ccl::v1::communicator::create_communicators<ccl::v1::device, ccl::v1::context>()  ???:0
10 0x0000000001fd70f4 ccl::detail::environment::create_communicators<ccl::v1::device, ccl::v1::context>()  ???:0
11 0x0000000006799982 c10d::ProcessGroupXCCL::getXCCLComm()  ???:0
12 0x00000000067a8d40 c10d::ProcessGroupXCCL::allgather()  ???:0
13 0x00000000067c34e1 c10d::ops::(anonymous namespace)::allgather_XPU()  Register.cpp:0
14 0x00000000067cac09 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call()  :0
15 0x000000000505f7a0 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
16 0x00000000059bd706 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), void>::call()  :0
17 0x00000000059c9d9e c10d::ProcessGroup::allgather()  :0
18 0x0000000005a5101a c10d::verify_params_across_processes()  ???:0
19 0x0000000000bea2fd pybind11::detail::argument_loader<c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&>::call<void, pybind11::gil_scoped_release, torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#109}&>()  init.cpp:0
20 0x0000000000c1c894 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#109}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#109}&&, void (*)(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
21 0x000000000037e7de pybind11::cpp_function::dispatcher()  :0
22 0x000000000013d0e6 cfunction_call()  /usr/local/src/conda/python-3.10.17/Objects/methodobject.c:543
23 0x000000000013d0e6 _Py_CheckFunctionResult()  /usr/local/src/conda/python-3.10.17/Objects/call.c:39
24 0x000000000013d0e6 cfunction_call()  /usr/local/src/conda/python-3.10.17/Objects/methodobject.c:554
25 0x00000000001360b3 _PyObject_MakeTpCall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:215
26 0x00000000001321b6 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:112
27 0x00000000001321b6 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:99
28 0x00000000001321b6 PyObject_Vectorcall()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:123
29 0x00000000001321b6 call_function()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5893
30 0x00000000001321b6 _PyEval_EvalFrameDefault()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:4181
31 0x000000000013d56c _PyEval_EvalFrame()  /usr/local/src/conda/python-3.10.17/Include/internal/pycore_ceval.h:46
32 0x000000000013d56c _PyEval_Vector()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5067
33 0x000000000013d56c _PyFunction_Vectorcall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:342
34 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:114
35 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:115
36 0x000000000012d2dc PyObject_Vectorcall()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:123
37 0x000000000012d2dc call_function()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5893
38 0x000000000012d2dc _PyEval_EvalFrameDefault()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:4213
39 0x000000000013d56c _PyEval_EvalFrame()  /usr/local/src/conda/python-3.10.17/Include/internal/pycore_ceval.h:46
40 0x000000000013d56c _PyEval_Vector()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5067
41 0x000000000013d56c _PyFunction_Vectorcall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:342
42 0x0000000000135487 _PyObject_FastCallDictTstate()  /usr/local/src/conda/python-3.10.17/Objects/call.c:153
43 0x0000000000135487 _PyObject_FastCallDictTstate()  /usr/local/src/conda/python-3.10.17/Objects/call.c:155
44 0x0000000000146799 _PyObject_Call_Prepend()  /usr/local/src/conda/python-3.10.17/Objects/call.c:431
45 0x0000000000146799 slot_tp_init()  /usr/local/src/conda/python-3.10.17/Objects/typeobject.c:7734
46 0x00000000001360cb type_call()  /usr/local/src/conda/python-3.10.17/Objects/typeobject.c:1135
47 0x00000000001360cb _PyObject_MakeTpCall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:215
48 0x0000000000132332 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:112
49 0x0000000000132332 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:99
50 0x0000000000132332 PyObject_Vectorcall()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:123
51 0x0000000000132332 call_function()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5893
52 0x0000000000132332 _PyEval_EvalFrameDefault()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:4231
53 0x000000000013d56c _PyEval_EvalFrame()  /usr/local/src/conda/python-3.10.17/Include/internal/pycore_ceval.h:46
54 0x000000000013d56c _PyEval_Vector()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5067
55 0x000000000013d56c _PyFunction_Vectorcall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:342
56 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:114
57 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:115
=================================
==== backtrace (tid: 168950) ====
 0 0x0000000000057900 __GI___sigaction()  ???:0
 1 0x0000000001c920a7 atl_ofi_prov_destroy()  :0
 2 0x0000000001c7752b atl_ofi::finalize()  :0
 3 0x0000000001c6b4b7 atl_ofi::init()  :0
 4 0x0000000001c81fac atl_ofi_comm::init_transport()  :0
 5 0x0000000001c83523 atl_ofi_comm::atl_ofi_comm()  :0
 6 0x0000000001c5378f atl_comm_manager::create()  :0
 7 0x0000000001e095b2 ccl_comm::create()  ???:0
 8 0x0000000001e3afdf ccl::comm_selector::create_comm_impl()  ???:0
 9 0x0000000001fd32cf ccl::v1::communicator::create_communicators<ccl::v1::device, ccl::v1::context>()  ???:0
10 0x0000000001fd70f4 ccl::detail::environment::create_communicators<ccl::v1::device, ccl::v1::context>()  ???:0
11 0x0000000006799982 c10d::ProcessGroupXCCL::getXCCLComm()  ???:0
12 0x00000000067a8d40 c10d::ProcessGroupXCCL::allgather()  ???:0
13 0x00000000067c34e1 c10d::ops::(anonymous namespace)::allgather_XPU()  Register.cpp:0
14 0x00000000067cac09 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call()  :0
15 0x000000000505f7a0 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
16 0x00000000059bd706 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), void>::call()  :0
17 0x00000000059c9d9e c10d::ProcessGroup::allgather()  :0
18 0x0000000005a5101a c10d::verify_params_across_processes()  ???:0
19 0x0000000000bea2fd pybind11::detail::argument_loader<c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&>::call<void, pybind11::gil_scoped_release, torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#109}&>()  init.cpp:0
20 0x0000000000c1c894 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#109}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#109}&&, void (*)(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
21 0x000000000037e7de pybind11::cpp_function::dispatcher()  :0
22 0x000000000013d0e6 cfunction_call()  /usr/local/src/conda/python-3.10.17/Objects/methodobject.c:543
23 0x000000000013d0e6 _Py_CheckFunctionResult()  /usr/local/src/conda/python-3.10.17/Objects/call.c:39
24 0x000000000013d0e6 cfunction_call()  /usr/local/src/conda/python-3.10.17/Objects/methodobject.c:554
25 0x00000000001360b3 _PyObject_MakeTpCall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:215
26 0x00000000001321b6 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:112
27 0x00000000001321b6 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:99
28 0x00000000001321b6 PyObject_Vectorcall()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:123
29 0x00000000001321b6 call_function()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5893
30 0x00000000001321b6 _PyEval_EvalFrameDefault()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:4181
31 0x000000000013d56c _PyEval_EvalFrame()  /usr/local/src/conda/python-3.10.17/Include/internal/pycore_ceval.h:46
32 0x000000000013d56c _PyEval_Vector()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5067
33 0x000000000013d56c _PyFunction_Vectorcall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:342
34 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:114
35 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:115
36 0x000000000012d2dc PyObject_Vectorcall()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:123
37 0x000000000012d2dc call_function()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5893
38 0x000000000012d2dc _PyEval_EvalFrameDefault()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:4213
39 0x000000000013d56c _PyEval_EvalFrame()  /usr/local/src/conda/python-3.10.17/Include/internal/pycore_ceval.h:46
40 0x000000000013d56c _PyEval_Vector()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5067
41 0x000000000013d56c _PyFunction_Vectorcall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:342
42 0x0000000000135487 _PyObject_FastCallDictTstate()  /usr/local/src/conda/python-3.10.17/Objects/call.c:153
43 0x0000000000135487 _PyObject_FastCallDictTstate()  /usr/local/src/conda/python-3.10.17/Objects/call.c:155
44 0x0000000000146799 _PyObject_Call_Prepend()  /usr/local/src/conda/python-3.10.17/Objects/call.c:431
45 0x0000000000146799 slot_tp_init()  /usr/local/src/conda/python-3.10.17/Objects/typeobject.c:7734
46 0x00000000001360cb type_call()  /usr/local/src/conda/python-3.10.17/Objects/typeobject.c:1135
47 0x00000000001360cb _PyObject_MakeTpCall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:215
48 0x0000000000132332 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:112
49 0x0000000000132332 _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:99
50 0x0000000000132332 PyObject_Vectorcall()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:123
51 0x0000000000132332 call_function()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5893
52 0x0000000000132332 _PyEval_EvalFrameDefault()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:4231
53 0x000000000013d56c _PyEval_EvalFrame()  /usr/local/src/conda/python-3.10.17/Include/internal/pycore_ceval.h:46
54 0x000000000013d56c _PyEval_Vector()  /usr/local/src/conda/python-3.10.17/Python/ceval.c:5067
55 0x000000000013d56c _PyFunction_Vectorcall()  /usr/local/src/conda/python-3.10.17/Objects/call.c:342
56 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:114
57 0x000000000012d2dc _PyObject_VectorcallTstate()  /usr/local/src/conda/python-3.10.17/Include/cpython/abstract.h:115
=================================
W0523 19:04:23.973000 168882 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 168949 closing signal SIGTERM
W0523 19:04:23.973000 168882 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 168950 closing signal SIGTERM
W0523 19:04:23.974000 168882 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 168951 closing signal SIGTERM
W0523 19:04:23.974000 168882 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 168952 closing signal SIGTERM
W0523 19:04:23.974000 168882 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 168955 closing signal SIGTERM
W0523 19:04:23.975000 168882 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 168956 closing signal SIGTERM
E0523 19:04:24.038000 168882 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -11) local_rank: 4 (pid: 168953) of binary: /home/bbrock/miniforge3/envs/xccl_py310/bin/python
Traceback (most recent call last):
  File "/home/bbrock/miniforge3/envs/xccl_py310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
elastic_ddp.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2025-05-23_19:04:23
  host      : hedp017-default
  rank      : 5 (local_rank: 5)
  exitcode  : -11 (pid: 168954)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 168954
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-23_19:04:23
  host      : hedp017-default
  rank      : 4 (local_rank: 4)
  exitcode  : -11 (pid: 168953)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 168953
========================================================

@jingxu10
Copy link
Contributor

I think the error came from

[1748048660.695946664] hedp017:rank4.python: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
2025:05:23-19:04:20:(168953) |CCL_ERROR| atl_ofi_helper.cpp:1204 atl_ofi_prov_init: fi_scalable_ep(prov->domain, info, &prov->sep, nullptr)
 fails with ret: -12, strerror: Cannot allocate memory
2025:05:23-19:04:20:(168953) |CCL_ERROR| atl_ofi_helper.cpp:1242 atl_ofi_prov_init: can't init provider psm3:autoselect_one:mlx5_0;mlx5_2

Seems like the default atl backend failed to operate psm3.

Could you share output of pip list | grep ccl and pip list | grep mpi?
If MPI library is installed, you can try export CCL_ATL_TRANSPORT=mpi, then run the torchrun command again. If it still fails, I'll reach out to internal team on the usage with psm3.

@BenBrock
Copy link
Author

If I switch the CCL_ATL_TRANSPORT=mpi, I get an error about MPI groups:

(xccl_py310) bbrock@hedp017:~/src/ai/pytorch-distributed> !torch
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29400 elastic_ddp.py
W0523 19:14:36.008000 172327 site-packages/torch/distributed/run.py:766]
W0523 19:14:36.008000 172327 site-packages/torch/distributed/run.py:766] *****************************************
W0523 19:14:36.008000 172327 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0523 19:14:36.008000 172327 site-packages/torch/distributed/run.py:766] *****************************************
Start running basic DDP example on rank 5.
Start running basic DDP example on rank 3.
Start running basic DDP example on rank 2.
Start running basic DDP example on rank 4.
Start running basic DDP example on rank 0.
Start running basic DDP example on rank 6.
Start running basic DDP example on rank 1.
Start running basic DDP example on rank 7.
2025:05:23-19:14:38:(172397) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL2025:05:23-19:14:38:(172401) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL2025:05:23-19:14:38:(172394) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL


2025:05:23-19:14:38:(172396) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:05:23-19:14:38:(172395) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:05:23-19:14:38:(172400) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:05:23-19:14:38:(172398) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2025:05:23-19:14:38:(172399) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
Abort(1007262470) on node 0 (rank 0 in comm 0): Fatal error in internal_Group_incl: Unknown error class, error stack:
internal_Group_incl(34044).......: MPI_Group_incl(group=0x88000001, n=8, ranks=0x563b29996820, newgroup=0x7ffd839abbc8) failed
MPIR_Group_check_valid_ranks(257): Duplicate ranks in rank array at index 6, has value 0 which is also the value at index 80
Abort(1007262470) on node 0 (rank 0 in comm 0): Fatal error in internal_Group_incl: Unknown error class, error stack:
internal_GAbort(470391558) on node 0 (rank 0 in comm 0): Fatal error in internal_Group_incl: Unknown error class, error stack:
internal_GrAbort(940153606) on node 0 (rank 0 in comm 0): Fatal error in internal_Group_incl: Unknown error class, error stack:
internal_GrAbort(470391558) on node 0 (rank 0 in comm 0): Fatal error in internal_Group_incl: Unknown error class, error stack:
internal_Grroup_incl(34044).......: MPI_Group_incl(group=0x88000001, n=8, ranks=0x5622d95f9a50, newgroup=0x7ffecf6f7c38) failed
MPIR_Group_oup_incl(34044).......: MPI_Group_incl(group=0x88000001, n=8, ranks=0x56092f66d420, newgroup=0x7ffeaf9ec1e8) failed
MPIR_Group_coup_incl(34044).......: MPI_Group_incl(group=0x88000001, n=8, ranks=0x564cf5e3ffb0, newgroup=0x7ffdb828f388) failed
MPIR_Group_coup_incl(34044).......: MPI_Group_incl(group=0x88000001, n=8, ranks=0x5598458d43d0, newgroup=0x7fff39fe34f8) failed
MPIR_Group_ccheck_valid_ranks(257): Duplicate ranks in rank array at index 6, has value 0 which is also the value at index 80
heck_valid_ranks(257): Duplicate ranks in rank array at index 6, has value 0 which is also the value at index 80
heck_valid_ranks(257): Duplicate ranks in rank array at index 6, has value 0 which is also the value at index 80
heck_valid_ranks(257): Duplicate ranks in rank array at index 6, has value 0 which is also the value at index 80
Abort(67738374) on node 0 (rank 0 in comm 0): Fatal error in internal_Group_incl: Unknown error class, error stack:
internal_Group_incl(34044).......: MPI_Group_incl(group=0x88000001, n=8, ranks=0x55cac6e3cc30, newgroup=0x7fff5ed56448) failed
MPIR_Group_check_valid_ranks(257): Duplicate ranks in rank array at index 6, has value 0 which is also the value at index 80
Abort(67738374) on node 0 (rank 0 in comm 0): Fatal error in internal_Group_incl: Unknown error class, error stack:
internal_Group_incl(34044).......: MPI_Group_incl(group=0x88000001, n=8, ranks=0x558c668f6a60, newgroup=0x7ffcd3945fa8) failed
MPIR_Group_check_valid_ranks(257): Duplicate ranks in rank array at index 6, has value 0 which is also the value at index 80
Abort(67738374) on node 0 (rank 0 in comm 0): Fatal error in internal_Group_incl: Unknown error class, error stack:
internal_Group_incl(34044).......: MPI_Group_incl(group=0x88000001, n=8, ranks=0x5645ef00f7d0, newgroup=0x7ffcbfd77998) failed
MPIR_Group_check_valid_ranks(257): Duplicate ranks in rank array at index 6, has value 0 which is also the value at index 80
W0523 19:14:40.160000 172327 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 172400 closing signal SIGTERM
E0523 19:14:40.176000 172327 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 6) local_rank: 0 (pid: 172394) of binary: /home/bbrock/miniforge3/envs/xccl_py310/bin/python
Traceback (most recent call last):
  File "/home/bbrock/miniforge3/envs/xccl_py310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/bbrock/miniforge3/envs/xccl_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
elastic_ddp.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-05-23_19:14:40
  host      : hedp017-default
  rank      : 1 (local_rank: 1)
  exitcode  : 6 (pid: 172395)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-05-23_19:14:40
  host      : hedp017-default
  rank      : 2 (local_rank: 2)
  exitcode  : 6 (pid: 172396)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2025-05-23_19:14:40
  host      : hedp017-default
  rank      : 3 (local_rank: 3)
  exitcode  : 6 (pid: 172397)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2025-05-23_19:14:40
  host      : hedp017-default
  rank      : 4 (local_rank: 4)
  exitcode  : 6 (pid: 172398)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2025-05-23_19:14:40
  host      : hedp017-default
  rank      : 5 (local_rank: 5)
  exitcode  : 6 (pid: 172399)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2025-05-23_19:14:40
  host      : hedp017-default
  rank      : 7 (local_rank: 7)
  exitcode  : 6 (pid: 172401)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-23_19:14:40
  host      : hedp017-default
  rank      : 0 (local_rank: 0)
  exitcode  : 6 (pid: 172394)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@jingxu10
Copy link
Contributor

I'll check and reply to you later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants