-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
bugSomething isn't workingSomething isn't workingonnxRelated to ONNX or ONNXRuntimeRelated to ONNX or ONNXRuntime
Description
Hi Triton team,
I am deploying the NVIDIA NeMo Titanet encoder model (speaker diarization) using Triton Inference Server with the ONNX Runtime backend. My goal is to support multiple concurrent clients, so I enabled dynamic batching.
However, when dynamic batching is enabled, inference fails with a cuDNN error. The same model works correctly when dynamic batching is disabled (single request per instance).
Environment:
- Triton Inference Server: 2.42.0
- Backend: onnxruntime_onnx (CUDA)
- GPU: NVIDIA GPU (single GPU setup)
- Model: NeMo Titanet encoder (ONNX)
- CUDA / cuDNN: Default versions from Triton 2.42.0 container
Triton Model Configuration:
name: "titanet_encoder"
platform: "onnxruntime_onnx"
max_batch_size: 32
input [
{
name: "features"
data_type: TYPE_FP32
dims: [80, -1]
},
{
name: "length"
data_type: TYPE_INT64
dims: [1]
}
]
output [
{
name: "embeddings"
data_type: TYPE_FP32
dims: [6144]
}
]
dynamic_batching {
preferred_batch_size: [4, 8, 16]
max_queue_delay_microseconds: 2000
}
instance_group [
{
kind: KIND_GPU
count: 1
}
]
Error Observed
When multiple requests with different audio lengths are dynamically batched, inference fails with:
tritonclient.utils.InferenceServerException: [500] onnx runtime error 1:
Non-zero status code returned while running FusedConv node.
Name:'/encoder/encoder/encoder.1/res.0.0/conv/Conv'
Status Message: CUDNN failure 3: CUDNN_STATUS_BAD_PARAM
file=onnxruntime/contrib_ops/cuda/fused_conv.cc
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingonnxRelated to ONNX or ONNXRuntimeRelated to ONNX or ONNXRuntime