fix: decouple collator template search from mask offset for BPE robustness by lefft · Pull Request #2248 · oumi-ai/oumi

lefft · 2026-03-06T23:41:14Z

Summary

An alternative to #2247 for fixing the BPE token boundary mismatch in DataCollatorForCompletionOnlyLM that caused training job 630-7257 to produce loss=0 across all steps.

Same root cause fix, but avoids including the deterministic \n token in loss computation.

Problem

When a chat template like <|im_start|>user\n is followed by message content starting with \n, the tokenizer merges the two newlines into a single BPE token (e.g. \n\n → token 271 on Qwen). The collator tokenizes the template standalone and gets the single-\n token (198), then searches for that token sequence in the training data — and never finds it. Result: all labels are set to ignore_index=-100, producing loss=0 for the entire training run.

This isn't just a data quality issue — it can also happen at inference time if a user prompt starts with \n.

Approach: Decouple search from masking

Instead of using the same token IDs for both searching and masking, this PR decouples them:

Operation	Token IDs used	Example (`<\|im_start\|>assistant\n`)
Search (find template position)	Stripped: `tokenizer.encode("...assistant")` → `[151644, 77091]`	Robust — matches regardless of what follows
Mask offset (how many tokens to exclude from loss)	Original: `len(tokenizer.encode("...assistant\n"))` → `3`	Preserves `\n` exclusion from loss

How it works in practice

Clean data (content doesn't start with \n):

Tokens: [..., 151644, 77091, 198,  9707, ...]
              ^^^^^^^^^^^^^^  ^^^  ^^^^
              search match    \n   "Hello" ← first token in loss

Search finds [151644, 77091] at position i
Mask i → i+3: covers [151644, 77091, 198] — same as current behavior ✓

Messy data (content starts with \n):

Tokens: [..., 151644, 77091, 271,  9707, ...]
              ^^^^^^^^^^^^^^  ^^^  ^^^^
              search match    \n\n "Hello" ← first token in loss
              (BPE merged)

Search finds [151644, 77091] at position i
Mask i → i+3: covers [151644, 77091, 271] — masks the merged token ✓

In the merged case, token 271 contains both the template's \n and the content's leading \n. Since you can't split a BPE token, masking it is correct — the template \n should be masked, and the content's leading \n is whitespace noise.

Why not just fix the data?

This is a self-serve fine-tuning platform — silently modifying customer data (stripping whitespace) is not appropriate
Leading \n can also appear at inference time in user prompts
The collator should be robust to content that's valid but has unusual whitespace

Test plan

All 14 existing collator tests pass
Manual verification with Qwen3 tokenizer against the failing dataset
Verify no loss skew compared to clean-data baseline

… merges Strip trailing \n from instruction/response templates before tokenizing for subsequence search, while preserving the original template token count for masking offsets. This prevents BPE merge mismatches (e.g. \n\n merging into a single token on Qwen/Llama tokenizers) from causing the collator to silently skip 100% of training examples, without including deterministic \n tokens in the loss computation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gitar-bot · 2026-03-06T23:41:16Z

Important

Upgrade your plan to unlock code review, CI analysis, custom rules, and more.

oelachqar closed this Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: decouple collator template search from mask offset for BPE robustness#2248

fix: decouple collator template search from mask offset for BPE robustness#2248
lefft wants to merge 1 commit intomainfrom
tim/fix-collator-bpe-boundary-v2

lefft commented Mar 6, 2026

Uh oh!

gitar-bot bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lefft commented Mar 6, 2026

Summary

Problem

Approach: Decouple search from masking

How it works in practice

Why not just fix the data?

Test plan

Related

Uh oh!

gitar-bot bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants