Skip to content

fix: decouple collator template search from mask offset for BPE robustness#2248

Closed
lefft wants to merge 1 commit intomainfrom
tim/fix-collator-bpe-boundary-v2
Closed

fix: decouple collator template search from mask offset for BPE robustness#2248
lefft wants to merge 1 commit intomainfrom
tim/fix-collator-bpe-boundary-v2

Conversation

@lefft
Copy link
Contributor

@lefft lefft commented Mar 6, 2026

Summary

An alternative to #2247 for fixing the BPE token boundary mismatch in DataCollatorForCompletionOnlyLM that caused training job 630-7257 to produce loss=0 across all steps.

Same root cause fix, but avoids including the deterministic \n token in loss computation.

Problem

When a chat template like <|im_start|>user\n is followed by message content starting with \n, the tokenizer merges the two newlines into a single BPE token (e.g. \n\n → token 271 on Qwen). The collator tokenizes the template standalone and gets the single-\n token (198), then searches for that token sequence in the training data — and never finds it. Result: all labels are set to ignore_index=-100, producing loss=0 for the entire training run.

This isn't just a data quality issue — it can also happen at inference time if a user prompt starts with \n.

Approach: Decouple search from masking

Instead of using the same token IDs for both searching and masking, this PR decouples them:

Operation Token IDs used Example (<|im_start|>assistant\n)
Search (find template position) Stripped: tokenizer.encode("...assistant")[151644, 77091] Robust — matches regardless of what follows
Mask offset (how many tokens to exclude from loss) Original: len(tokenizer.encode("...assistant\n"))3 Preserves \n exclusion from loss

How it works in practice

Clean data (content doesn't start with \n):

Tokens: [..., 151644, 77091, 198,  9707, ...]
              ^^^^^^^^^^^^^^  ^^^  ^^^^
              search match    \n   "Hello" ← first token in loss

Search finds [151644, 77091] at position i
Mask i → i+3: covers [151644, 77091, 198] — same as current behavior ✓

Messy data (content starts with \n):

Tokens: [..., 151644, 77091, 271,  9707, ...]
              ^^^^^^^^^^^^^^  ^^^  ^^^^
              search match    \n\n "Hello" ← first token in loss
              (BPE merged)

Search finds [151644, 77091] at position i
Mask i → i+3: covers [151644, 77091, 271] — masks the merged token ✓

In the merged case, token 271 contains both the template's \n and the content's leading \n. Since you can't split a BPE token, masking it is correct — the template \n should be masked, and the content's leading \n is whitespace noise.

Why not just fix the data?

  • This is a self-serve fine-tuning platform — silently modifying customer data (stripping whitespace) is not appropriate
  • Leading \n can also appear at inference time in user prompts
  • The collator should be robust to content that's valid but has unusual whitespace

Test plan

  • All 14 existing collator tests pass
  • Manual verification with Qwen3 tokenizer against the failing dataset
  • Verify no loss skew compared to clean-data baseline

Related

🤖 Generated with Claude Code

… merges

Strip trailing \n from instruction/response templates before tokenizing
for subsequence search, while preserving the original template token
count for masking offsets. This prevents BPE merge mismatches (e.g.
\n\n merging into a single token on Qwen/Llama tokenizers) from causing
the collator to silently skip 100% of training examples, without
including deterministic \n tokens in the loss computation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gitar-bot
Copy link

gitar-bot bot commented Mar 6, 2026

Important

Upgrade your plan to unlock code review, CI analysis, custom rules, and more.

@oelachqar oelachqar closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants