Skip to content

Dynamic batching with variable-length audio fails for NeMo Titanet (ONNX Runtime CUDA – CUDNN_STATUS_BAD_PARAM) #8644

@varshilgandhi

Description

@varshilgandhi

Hi Triton team,

I am deploying the NVIDIA NeMo Titanet encoder model (speaker diarization) using Triton Inference Server with the ONNX Runtime backend. My goal is to support multiple concurrent clients, so I enabled dynamic batching.

However, when dynamic batching is enabled, inference fails with a cuDNN error. The same model works correctly when dynamic batching is disabled (single request per instance).

Environment:

  • Triton Inference Server: 2.42.0
  • Backend: onnxruntime_onnx (CUDA)
  • GPU: NVIDIA GPU (single GPU setup)
  • Model: NeMo Titanet encoder (ONNX)
  • CUDA / cuDNN: Default versions from Triton 2.42.0 container

Triton Model Configuration:

name: "titanet_encoder"
platform: "onnxruntime_onnx"

max_batch_size: 32

input [
  {
    name: "features"
    data_type: TYPE_FP32
    dims: [80, -1]
  },
  {
    name: "length"
    data_type: TYPE_INT64
    dims: [1]
  }
]

output [
  {
    name: "embeddings"
    data_type: TYPE_FP32
    dims: [6144]
  }
]

dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 2000
}

instance_group [
  {
    kind: KIND_GPU
    count: 1
  }
]

Error Observed

When multiple requests with different audio lengths are dynamically batched, inference fails with:

tritonclient.utils.InferenceServerException: [500] onnx runtime error 1:
Non-zero status code returned while running FusedConv node.
Name:'/encoder/encoder/encoder.1/res.0.0/conv/Conv'
Status Message: CUDNN failure 3: CUDNN_STATUS_BAD_PARAM
file=onnxruntime/contrib_ops/cuda/fused_conv.cc

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingonnxRelated to ONNX or ONNXRuntime

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions