Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

gaohongkui · 2025-02-11T08:16:07Z

System Info

{"model_id":"bge-large-zh-v1.5/","model_sha":null,"model_dtype":"float16","model_type":{"embedding":{"pooling":"cls"}},"max_concurrent_requests":512,"max_input_length":512,"max_batch_tokens":16384,"max_batch_requests":null,"max_client_batch_size":32,"auto_truncate":true,"tokenization_workers":10,"version":"1.6.0","sha":"f0e491a290385ef06f0871d188b21c0308ba86d6","docker_label":"sha-f0e491a"}

Description
When using the /tokenize endpoint with the bge-large-zh-v1.5 model deployed via text-embeddings-inference, the returned text, start, and stop fields do not align with the actual token IDs. This behavior differs from the results produced by the transformers library.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Steps to Reproduce

Deploy BAAI/bge-large-zh-v1.5 using text-embeddings-inference.
Send a tokenization request:

curl localhost/tokenize -X POST -d '{"inputs":"北京天安门"}' -H 'Content-Type: application/json'

Observe the response:

[[
  {"id":101,"text":"[CLS]","special":true,"start":null,"stop":null},
  {"id":1266,"text":" 北京天","special":false,"start":0,"stop":3},
  {"id":776,"text":"安门","special":false,"start":3,"stop":6},
  {"id":1921,"text":"","special":false,"start":6,"stop":9},
  {"id":2128,"text":"","special":false,"start":9,"stop":12},
  {"id":7305,"text":"","special":false,"start":12,"stop":15},
  {"id":102,"text":"[SEP]","special":true,"start":null,"stop":null}
]]

Expected behavior

Expected Behavior
The correct tokenization (verified via transformers) should produce:

101 -> [CLS]
1266 -> 北
776 -> 京
1921 -> 天
2128 -> 安
7305 -> 门
102 -> [SEP]

Each token ID should map to a single character in the input text.
start/stop offsets should align with character boundaries (e.g., 1266 corresponds to 北 at position 0-1).

Actual Behavior

Token 1266 incorrectly maps to 北京天 (positions 0-3) instead of 北 (position 0-1).
Tokens 1921, 2128, 7305 return empty text values despite valid IDs.
Offset positions (e.g., start:6, stop:9 for token 1921) do not match the expected single-character alignment.

Additional Context

transformers code showing correct behavior:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-zh-v1.5")
result = tokenizer("北京天安门", return_offsets_mapping=True)
for id, mapping in zip(result["input_ids"], result["offset_mapping"]):
    print(f"{id} -> {text[mapping[0]:mapping[1]]}")

Suspected issue: Incorrect handling of offset mappings or token-to-text alignment in the tokenization output formatting.

Environment

text-embeddings-inference version: [please specify]
Deployment method: [Docker/local/etc.]
Model: BAAI/bge-large-zh-v1.5

Let me know if you need further details to investigate this! 🙌

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

gaohongkui commented Feb 11, 2025

Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

Comments

gaohongkui commented Feb 11, 2025

System Info

Information

Tasks

Reproduction

Expected behavior