-
Notifications
You must be signed in to change notification settings - Fork 261
Could not start backend: cannot find tensor embeddings.word_embeddings.weight #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @momomobinx I just tried to run it with the latest release of TEI being 1.6.1, in your case the URI looks like |
Also note that the reply above is in relation to the issue title of "Could not start backend: cannot find tensor embeddings.word_embeddings.weight", but I also tried with the same parameters as you ran TEI, and it also worked fine! If you could try to reproduce with the latest version or share more details that would be great 🤗 Also @momomobinx, I'll cc @jetnet too as they reacted to the comment assuming they may have the same or a similar issue! |
The situation where I encountered this issue is that I have already loaded an LLM using VLLM and then run TGI to the same graphics card, and this problem occurs. When I exit VLLM, execute TGI first and then VLLM, everything will be normal. |
Hmm but when you say TGI there you mean TEI, right? Also the issue that you mentioned in the title seems to be related to an unsupported architecture as it tries to load the tensors into the backend but the mechanism to unwrap the values for each key in the tensors dict fails, and that is not happening in the latest TEI release as far as I could test, so could you try with TEI again or provide more details on how to reproduce the original issue? Thanks a lot in advance @momomobinx 🤗 |
vllm start -> tei error tei start -> vllm start -> work vllm 0.7.3 nvidia-smi
os-release
docker -v
vllm
TEI embedding
TEI reranker
|
tei start -> nvidia-smi cmd work in docker -> vllm start -> docker exec -it embedding nvidia-smi |
Ok so what you want to achieve is to deploy the DeepSeek AWQ model sharded on all the 4 x NVIDIA 3090 GPUs, and then attempt to re-use the remaining memory for TEI? For context, how are you measuring the available VRAM on those instances? Do you have more information on the failing stack trace? Also it's most likely due to missing memory or failing to allocate it as it's "shared" with another process running on the same device, anyway I'll try to investigate on my end, but most likely due to the aforementioned. |
Yes, it's just as you guessed. I am attempting to execute TEI using the remaining resources. I originally wanted to collect some logs, but strangely, it doesn't seem like this issue occurred today. |
I'm not super familiar with vLLM, recent work, but if it's anything like TGI which is very similar, it will attempt to use ALL possible memory when loading up. Therefore loading (TEI -> vLLM) works because takes ressources first) while the other way around doesn't because vLLM took all the memory, so TEI cannot work properly. Now the error message you're sending (cannot find tensor) seems quite misleading if my guess is true.. Now to fix it, vLLM most likely has a flag to CAP it's VRAM usage so you could spare some vRAM for TEI usage, in TGI it's called |
It is possible, but I have limited the memory usage of VLLM to 0.9 |
System Info
docker
nvidia-smi
Information
Tasks
Reproduction
docker run
-d
--name reranker
--gpus '"device=0"'
--env CUDA_VISIBLE_DEVICES=0
-p 7863:80
-v /data/ai/models:/data
ghcr.io/huggingface/text-embeddings-inference:86-1.5
--model-id "/data/bge-reranker-base"
--dtype "float16"
--max-concurrent-requests 2048
--max-batch-tokens 32768000
--max-batch-requests 128
--max-client-batch-size 4096
--auto-truncate
--tokenization-workers 64
--payload-limit 16000000
Expected behavior
It was still running normally before, until I encountered the context was too long, and then I couldn't successfully restart the model
The text was updated successfully, but these errors were encountered: