Skip to content

CUDA IPC Handle failure in WSL2 forces CPU memory fallback, causing massive PCIe bottlenecks in BLS pipelines #8670

@CarlosNacher

Description

@CarlosNacher

Description
We are running a complex vision pipeline in Triton using Business Logic Scripting (BLS) in a Python Backend. The pipeline consists of:

  1. Python BLS receives image bytes.
  2. BLS calls a DALI backend model (running on GPU) to decode and preprocess images.
  3. BLS receives the preprocessed tensors (large images, ~178 MB per batch) and sends them to multiple TensorRT models for inference.
  4. To avoid PCIe overhead, we attempt to keep the DALI output tensors in VRAM by setting preferred_memory=pb_utils.PreferredMemory(pb_utils.TRITONSERVER_MEMORY_GPU) in the InferenceRequest inside the BLS.
  5. However, because we are running on Windows / WSL2, Triton throws the following error when trying to pass the GPU tensor back to the Python stub:
Failed to initialize CUDA shared memory pool in Python stub: Failed to open the cudaIpcHandle. error: invalid resource handle

Because CUDA IPC is not fully supported or fails under WSL2 virtualization, we are forced to fall back to TRITONSERVER_MEMORY_CPU. This forces Triton to download 178MB from VRAM to RAM, and then re-upload it to VRAM for each subsequent TRT model call. This creates a massive PCIe bottleneck (taking ~100ms per BLS async call) and causes severe GPU contention.

Attached Triton traces where the described behaviour can be observed (DALI model in station01 takes a lot, 250ms instead of 20ms, which I think is caused by CPU / PICe bottleneck).

trace-20260220-123000-task67-batch-2BLS2DALI_copiaparaIssue.json

Triton Information
What version of Triton are you using?

  • NVIDIA Release 25.02 (build 143749457)
  • Triton Server Version 2.55.0

Are you using the Triton container or did you build it yourself?
Using the official NVIDIA Triton container (FROM nvcr.io/nvidia/tritonserver:25.02-py3) running on Docker under WSL2 + Docker Engine on Windows 10 . Also tried in another PC with Windows 10 + Docker Desktop instead of Engine.

To Reproduce

  • Run Triton inside WSL2/Docker on a Windows host machine with a GPU.
  • Create a Python BLS model.
  • Call a child model (e.g., DALI or any TRT model) that outputs a tensor residing in GPU memory.
  • In the BLS InferenceRequest, explicitly request the output to stay in GPU memory:
  • The request crashes with the Failed to open the cudaIpcHandle error.

Expected behavior
We expect one of the following:

  • Triton to be able to share GPU pointers between the C++ backend and the Python stub in WSL2 without relying on standard Linux IPC mechanisms that break under Windows virtualization.
  • Alternatively, a mechanism to pass raw DLPack capsules/pointers directly between GPU models orchestrated by a Python BLS without the Python runtime ever attempting to map or inspect the memory pool.

Currently, this WSL2 limitation forces a Host-to-Device and Device-to-Host copy for every step in a BLS pipeline, making high-performance Edge AI on Windows severely handicapped compared to Native Linux.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions