-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Description
We are running a complex vision pipeline in Triton using Business Logic Scripting (BLS) in a Python Backend. The pipeline consists of:
- Python BLS receives image bytes.
- BLS calls a DALI backend model (running on GPU) to decode and preprocess images.
- BLS receives the preprocessed tensors (large images, ~178 MB per batch) and sends them to multiple TensorRT models for inference.
- To avoid PCIe overhead, we attempt to keep the DALI output tensors in VRAM by setting
preferred_memory=pb_utils.PreferredMemory(pb_utils.TRITONSERVER_MEMORY_GPU)in the InferenceRequest inside the BLS. - However, because we are running on Windows / WSL2, Triton throws the following error when trying to pass the GPU tensor back to the Python stub:
Failed to initialize CUDA shared memory pool in Python stub: Failed to open the cudaIpcHandle. error: invalid resource handle
Because CUDA IPC is not fully supported or fails under WSL2 virtualization, we are forced to fall back to TRITONSERVER_MEMORY_CPU. This forces Triton to download 178MB from VRAM to RAM, and then re-upload it to VRAM for each subsequent TRT model call. This creates a massive PCIe bottleneck (taking ~100ms per BLS async call) and causes severe GPU contention.
Attached Triton traces where the described behaviour can be observed (DALI model in station01 takes a lot, 250ms instead of 20ms, which I think is caused by CPU / PICe bottleneck).
trace-20260220-123000-task67-batch-2BLS2DALI_copiaparaIssue.json
Triton Information
What version of Triton are you using?
- NVIDIA Release 25.02 (build 143749457)
- Triton Server Version 2.55.0
Are you using the Triton container or did you build it yourself?
Using the official NVIDIA Triton container (FROM nvcr.io/nvidia/tritonserver:25.02-py3) running on Docker under WSL2 + Docker Engine on Windows 10 . Also tried in another PC with Windows 10 + Docker Desktop instead of Engine.
To Reproduce
- Run Triton inside WSL2/Docker on a Windows host machine with a GPU.
- Create a Python BLS model.
- Call a child model (e.g., DALI or any TRT model) that outputs a tensor residing in GPU memory.
- In the BLS InferenceRequest, explicitly request the output to stay in GPU memory:
- The request crashes with the Failed to open the cudaIpcHandle error.
Expected behavior
We expect one of the following:
- Triton to be able to share GPU pointers between the C++ backend and the Python stub in WSL2 without relying on standard Linux IPC mechanisms that break under Windows virtualization.
- Alternatively, a mechanism to pass raw DLPack capsules/pointers directly between GPU models orchestrated by a Python BLS without the Python runtime ever attempting to map or inspect the memory pool.
Currently, this WSL2 limitation forces a Host-to-Device and Device-to-Host copy for every step in a BLS pipeline, making high-performance Edge AI on Windows severely handicapped compared to Native Linux.