Skip to content

Conversation

@HeyPhiS
Copy link

@HeyPhiS HeyPhiS commented Jan 9, 2026

What problem does this PR solve?

Guarding the final slice in _split_by_lang ensures we never run past the end of the segment when e can temporarily exceed len(a) (e.g., when the last run updates e = s + 1). The change now clamps the upper bound with min(e, len(a)), so we always append a valid substring and avoid the occasional “list index out of range” crash the tokenizer was seeing.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • Python SDK impacted, Need to update PyPI

@yingfeng yingfeng requested a review from KevinHuSh January 12, 2026 05:26
zh = _zh
if s >= len(a):
continue
txt_lang_pairs.append((a[s:e], zh))
Copy link

@KevinHuSh KevinHuSh Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciations!
I can't see any side effects If e exceeds the length of a.
a[s:min(e, len(a))] equals to a[s:e].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants