You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
When using the /tokenize endpoint with the bge-large-zh-v1.5 model deployed via text-embeddings-inference, the returned text, start, and stop fields do not align with the actual token IDs. This behavior differs from the results produced by the transformers library.
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Steps to Reproduce
Deploy BAAI/bge-large-zh-v1.5 using text-embeddings-inference.
Send a tokenization request:
curl localhost/tokenize -X POST -d '{"inputs":"北京天安门"}' -H 'Content-Type: application/json'
System Info
{"model_id":"bge-large-zh-v1.5/","model_sha":null,"model_dtype":"float16","model_type":{"embedding":{"pooling":"cls"}},"max_concurrent_requests":512,"max_input_length":512,"max_batch_tokens":16384,"max_batch_requests":null,"max_client_batch_size":32,"auto_truncate":true,"tokenization_workers":10,"version":"1.6.0","sha":"f0e491a290385ef06f0871d188b21c0308ba86d6","docker_label":"sha-f0e491a"}
Description
When using the
/tokenize
endpoint with thebge-large-zh-v1.5
model deployed via text-embeddings-inference, the returnedtext
,start
, andstop
fields do not align with the actual token IDs. This behavior differs from the results produced by thetransformers
library.Information
Tasks
Reproduction
Steps to Reproduce
BAAI/bge-large-zh-v1.5
using text-embeddings-inference.Expected behavior
Expected Behavior
The correct tokenization (verified via
transformers
) should produce:start
/stop
offsets should align with character boundaries (e.g.,1266
corresponds to北
at position 0-1).Actual Behavior
1266
incorrectly maps to北京天
(positions 0-3) instead of北
(position 0-1).1921
,2128
,7305
return emptytext
values despite valid IDs.start:6, stop:9
for token1921
) do not match the expected single-character alignment.Additional Context
transformers
code showing correct behavior:Environment
BAAI/bge-large-zh-v1.5
Let me know if you need further details to investigate this! 🙌
The text was updated successfully, but these errors were encountered: