-
Notifications
You must be signed in to change notification settings - Fork 345
Description
System Info
Environment:
- CUDA Driver Version: 550.105.08
- CUDA Version: 13.0
- GPU: NVIDIA H20
- GPU Memory: 97871 MiB (~96 GB)
- Platform: OpenShift/Kubernetes
- TEI Version: 1.8.3
- TEI Image: ghcr.io/huggingface/text-embeddings-inference:hopper-1.8.3
- Model: Qwen/Qwen3-Embedding-8B
Launch Parameters:
--model-id Qwen/Qwen3-Embedding-8B
--pooling mean
--max-batch-requests 128
--max-concurrent-requests 256
--max-batch-tokens 40960
--dtype float16
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
-
Pull and run the TEI hopper image on a system with CUDA 13.0:
docker run --gpus all -p 80:80
ghcr.io/huggingface/text-embeddings-inference:hopper-1.8.3
--model-id Qwen/Qwen3-Embedding-8B
--pooling mean
--max-batch-requests 128
--max-concurrent-requests 256
--max-batch-tokens 40960
--dtype float16 -
Observe the initialization logs:
- Container starts and text embeddings router initializes
- Attempts to load model: Qwen/Qwen3-Embedding-8B
- WARNING appears: "Could not find a Sentence Transformers config"
- INFO: "Maximum number of tokens per request: 40960"
- INFO: "Starting 8 tokenization workers"
- INFO: "Starting model backend"
- INFO: "Starting FlashOwn3 model on Cuda(CudaDevice(DeviceId(1)))"
- CRASH: "Floating point exception (core dumped)"
-
Check GPU status:
nvidia-smi
Result: Shows 0% GPU utilization, 0 MiB memory usage, no running processes -
Result: Container fails to serve embeddings, GPU remains unused
Note: The same model works correctly with the standard CUDA 12 variant:
docker run --gpus all -p 80:80
ghcr.io/huggingface/text-embeddings-inference:1.8.3
--model-id Qwen/Qwen3-Embedding-8B
--pooling mean
Expected behavior
Container should start successfully without crashes
Questions:
- Is CUDA 13.x support planned for the hopper variant?
- Can a hopper-cuda13 image variant be provided?