Skip to content

Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 4 tasks
gaohongkui opened this issue Feb 11, 2025 · 0 comments
Open
3 of 4 tasks

Incorrect Tokenization Output for bge-large-zh-v1.5 Model #494

gaohongkui opened this issue Feb 11, 2025 · 0 comments

Comments

@gaohongkui
Copy link

System Info

{"model_id":"bge-large-zh-v1.5/","model_sha":null,"model_dtype":"float16","model_type":{"embedding":{"pooling":"cls"}},"max_concurrent_requests":512,"max_input_length":512,"max_batch_tokens":16384,"max_batch_requests":null,"max_client_batch_size":32,"auto_truncate":true,"tokenization_workers":10,"version":"1.6.0","sha":"f0e491a290385ef06f0871d188b21c0308ba86d6","docker_label":"sha-f0e491a"}

Description
When using the /tokenize endpoint with the bge-large-zh-v1.5 model deployed via text-embeddings-inference, the returned text, start, and stop fields do not align with the actual token IDs. This behavior differs from the results produced by the transformers library.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps to Reproduce

  1. Deploy BAAI/bge-large-zh-v1.5 using text-embeddings-inference.
  2. Send a tokenization request:
curl localhost/tokenize -X POST -d '{"inputs":"北京天安门"}' -H 'Content-Type: application/json'
  1. Observe the response:
[[
  {"id":101,"text":"[CLS]","special":true,"start":null,"stop":null},
  {"id":1266,"text":" 北京天","special":false,"start":0,"stop":3},
  {"id":776,"text":"安门","special":false,"start":3,"stop":6},
  {"id":1921,"text":"","special":false,"start":6,"stop":9},
  {"id":2128,"text":"","special":false,"start":9,"stop":12},
  {"id":7305,"text":"","special":false,"start":12,"stop":15},
  {"id":102,"text":"[SEP]","special":true,"start":null,"stop":null}
]]

Expected behavior

Expected Behavior
The correct tokenization (verified via transformers) should produce:

101 -> [CLS]
1266 -> 北
776 -> 京
1921 -> 天
2128 -> 安
7305 -> 门
102 -> [SEP]
  • Each token ID should map to a single character in the input text.
  • start/stop offsets should align with character boundaries (e.g., 1266 corresponds to at position 0-1).

Actual Behavior

  • Token 1266 incorrectly maps to 北京天 (positions 0-3) instead of (position 0-1).
  • Tokens 1921, 2128, 7305 return empty text values despite valid IDs.
  • Offset positions (e.g., start:6, stop:9 for token 1921) do not match the expected single-character alignment.

Additional Context

  • transformers code showing correct behavior:
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-zh-v1.5")
    result = tokenizer("北京天安门", return_offsets_mapping=True)
    for id, mapping in zip(result["input_ids"], result["offset_mapping"]):
        print(f"{id} -> {text[mapping[0]:mapping[1]]}")
  • Suspected issue: Incorrect handling of offset mappings or token-to-text alignment in the tokenization output formatting.

Environment

  • text-embeddings-inference version: [please specify]
  • Deployment method: [Docker/local/etc.]
  • Model: BAAI/bge-large-zh-v1.5

Let me know if you need further details to investigate this! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant