Skip to content

Could not start backend: cannot find tensor embeddings.word_embeddings.weight #533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
momomobinx opened this issue Mar 26, 2025 · 10 comments
Open
2 of 4 tasks
Assignees

Comments

@momomobinx
Copy link

System Info

docker

docker run \
        -d \
        --name reranker \
        --gpus '"device=0"' \
        --env CUDA_VISIBLE_DEVICES=0 \
        -p 7863:80 \
        -v /data/ai/models:/data \
        ghcr.io/huggingface/text-embeddings-inference:86-1.5 \
        --model-id "/data/bge-reranker-base" \
        --dtype "float16" \
        --max-concurrent-requests 2048 \
        --max-batch-tokens 32768000 \
        --max-batch-requests 128 \
        --max-client-batch-size 4096 \
        --auto-truncate \
        --tokenization-workers 64 \
        --payload-limit 16000000

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142                Driver Version: 550.142        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:5E:00.0 Off |                  N/A |
| 42%   22C    P8             17W /  350W |   24237MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

docker run
-d
--name reranker
--gpus '"device=0"'
--env CUDA_VISIBLE_DEVICES=0
-p 7863:80
-v /data/ai/models:/data
ghcr.io/huggingface/text-embeddings-inference:86-1.5
--model-id "/data/bge-reranker-base"
--dtype "float16"
--max-concurrent-requests 2048
--max-batch-tokens 32768000
--max-batch-requests 128
--max-client-batch-size 4096
--auto-truncate
--tokenization-workers 64
--payload-limit 16000000

Expected behavior

It was still running normally before, until I encountered the context was too long, and then I couldn't successfully restart the model

@alvarobartt
Copy link
Member

alvarobartt commented Apr 4, 2025

Hey @momomobinx I just tried to run it with the latest release of TEI being 1.6.1, in your case the URI looks like ghcr.io/huggingface/text-embeddings-inference:86-1.6.1, and https://huggingface.co/BAAI/bge-reranker-base seems to load and work just fine, could you check with that version instead? Thanks in advance 🤗

Release at https://github.com/huggingface/text-embeddings-inference/pkgs/container/text-embeddings-inference/383795099?tag=86-1.6.1

@alvarobartt
Copy link
Member

Also note that the reply above is in relation to the issue title of "Could not start backend: cannot find tensor embeddings.word_embeddings.weight", but I also tried with the same parameters as you ran TEI, and it also worked fine! If you could try to reproduce with the latest version or share more details that would be great 🤗

Also @momomobinx, I'll cc @jetnet too as they reacted to the comment assuming they may have the same or a similar issue!

@alvarobartt alvarobartt self-assigned this Apr 4, 2025
@momomobinx
Copy link
Author

Also note that the reply above is in relation to the issue title of "Could not start backend: cannot find tensor embeddings.word_embeddings.weight", but I also tried with the same parameters as you ran TEI, and it also worked fine! If you could try to reproduce with the latest version or share more details that would be great 🤗

Also @momomobinx, I'll cc @jetnet too as they reacted to the comment assuming they may have the same or a similar issue!

The situation where I encountered this issue is that I have already loaded an LLM using VLLM and then run TGI to the same graphics card, and this problem occurs. When I exit VLLM, execute TGI first and then VLLM, everything will be normal.

@alvarobartt
Copy link
Member

Hmm but when you say TGI there you mean TEI, right? Also the issue that you mentioned in the title seems to be related to an unsupported architecture as it tries to load the tensors into the backend but the mechanism to unwrap the values for each key in the tensors dict fails, and that is not happening in the latest TEI release as far as I could test, so could you try with TEI again or provide more details on how to reproduce the original issue? Thanks a lot in advance @momomobinx 🤗

@momomobinx
Copy link
Author

momomobinx commented Apr 7, 2025

vllm start -> tei error

tei start -> vllm start -> work

vllm 0.7.3

nvidia-smi

nvidia-smi 
Mon Apr  7 07:28:36 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.142                Driver Version: 550.142        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:5A:00.0 Off |                  N/A |
| 41%   24C    P8             24W /  350W |   21695MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:5E:00.0 Off |                  N/A |
| 42%   23C    P8             17W /  350W |   21305MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  |   00000000:62:00.0 Off |                  N/A |
| 41%   23C    P8             18W /  350W |   20806MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:66:00.0 Off |                  N/A |
| 42%   24C    P8             20W /  350W |   20806MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    922333      C   text-embeddings-router                        880MiB |
|    0   N/A  N/A    924009      C   ...env/.conda/envs/new-vllm/bin/python      20798MiB |
|    1   N/A  N/A    924635      C   ...env/.conda/envs/new-vllm/bin/python      20792MiB |
|    1   N/A  N/A   1001862      C   text-embeddings-router                        496MiB |
|    2   N/A  N/A    924636      C   ...env/.conda/envs/new-vllm/bin/python      20792MiB |
|    3   N/A  N/A    924637      C   ...env/.conda/envs/new-vllm/bin/python      20792MiB |
+-----------------------------------------------------------------------------------------+

os-release

cat /etc/os-release 
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logocat /etc/os-release 
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

docker -v

docker -v
Docker version 26.1.2, build 211e74b

vllm

nohup sh -c 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 python -m vllm.entrypoints.openai.api_server \
        --port=8000 \
        --gpu-memory-utilization=0.9 \
        --tensor-parallel-size=4 \
        --pipeline-parallel-size=1 \
        --trust-remote-code \
	--enable-prefix-caching \
	--num-scheduler-steps 10 \
	--max-model-len=28032 \
        --served-model-name deepseek_r1 \
	--enable-reasoning \
	--reasoning-parser  deepseek_r1 \
        --model /data/ai/models/QwQ-32B-AWQ  ' >> vllm.log 2>>vllm.log &

TEI embedding

docker run \
	-d \
	--name embedding \
	--gpus '"device=1"' \
	--env CUDA_VISIBLE_DEVICES=0 \
	-p 7862:80 \
	-v /data/ai/models:/data \
	ghcr.io/huggingface/text-embeddings-inference:86-1.5 \
	--model-id "/data/gte-base-zh" \
        --dtype "float16" \
        --pooling "mean" \
        --max-concurrent-requests 2048 \
        --max-batch-tokens 32768000 \
        --max-batch-requests 128 \
        --max-client-batch-size 4096 \
        --auto-truncate \
        --tokenization-workers 64 \
        --payload-limit 16000000

TEI reranker

docker run \
	-d \
	--name reranker \
	--gpus '"device=0"' \
	--env CUDA_VISIBLE_DEVICES=0 \
	-p 7863:80 \
	-v /data/ai/models:/data \
	ghcr.io/huggingface/text-embeddings-inference:86-1.5 \
	--model-id "/data/bge-reranker-base" \
        --dtype "float16" \
        --max-concurrent-requests 2048 \
        --max-batch-tokens 32768000 \
        --max-batch-requests 128 \
        --max-client-batch-size 4096 \
        --auto-truncate \
        --tokenization-workers 64 \
        --payload-limit 16000000

@momomobinx
Copy link
Author

tei start -> nvidia-smi cmd work in docker -> vllm start -> docker exec -it embedding nvidia-smi
Failed to initialize NVML: Unknown Error

@alvarobartt
Copy link
Member

Ok so what you want to achieve is to deploy the DeepSeek AWQ model sharded on all the 4 x NVIDIA 3090 GPUs, and then attempt to re-use the remaining memory for TEI? For context, how are you measuring the available VRAM on those instances? Do you have more information on the failing stack trace? Also it's most likely due to missing memory or failing to allocate it as it's "shared" with another process running on the same device, anyway I'll try to investigate on my end, but most likely due to the aforementioned.

@momomobinx
Copy link
Author

Yes, it's just as you guessed. I am attempting to execute TEI using the remaining resources. I originally wanted to collect some logs, but strangely, it doesn't seem like this issue occurred today.

@Narsil
Copy link
Collaborator

Narsil commented Apr 8, 2025

I'm not super familiar with vLLM, recent work, but if it's anything like TGI which is very similar, it will attempt to use ALL possible memory when loading up.

Therefore loading (TEI -> vLLM) works because takes ressources first) while the other way around doesn't because vLLM took all the memory, so TEI cannot work properly.
Educated guess here but seems reasonable.

Now the error message you're sending (cannot find tensor) seems quite misleading if my guess is true..

Now to fix it, vLLM most likely has a flag to CAP it's VRAM usage so you could spare some vRAM for TEI usage, in TGI it's called --cuda-memory-fraction 0.7 for instance (which would save 30% of the VRAM for whatever else you want). I'm pretty sure vLLM has a similar flag.

@momomobinx
Copy link
Author

It is possible, but I have limited the memory usage of VLLM to 0.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants