fix: strip trailing newlines from collator templates before tokenizing#2247
Draft
fix: strip trailing newlines from collator templates before tokenizing#2247
Conversation
Fixes a silent training failure where loss=0 for the entire run when dataset message content starts with \n. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Important Upgrade your plan to unlock code review, CI analysis, custom rules, and more. |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A customer training job (job 630-7257, Qwen3-4B-Instruct) completed with loss=0.0 for the entire run — 62 steps across 2 epochs, zero gradient updates, saved checkpoint identical to the base model.
The
DataCollatorForCompletionOnlyLMskipped every single example (973/973) because it could not find theinstruction_templatetoken IDs in the tokenized training data.Root Cause
BPE token boundary mismatch between the collator's standalone-tokenized template and the same template as it appears in the full chat-template-tokenized sequence.
Step by step
Collator init tokenizes
"<|im_start|>user\n"standalone →[151644, 872, 198](where198=\n)Chat template renders messages as
<|im_start|>{role}\n{content}<|im_end|>. The customer's dataset had message content starting with\n(e.g."\nPlease classify..."), so the rendered text became:BPE merge: The Qwen3 tokenizer has a merge rule for double-newline:
\n\n→ token271. So the full sequence contains[151644, 872, 271, ...]— not[..., 198].Subsequence search fails: The collator searches for
[151644, 872, 198]in the token IDs, never finds it, sets all labels toignore_index=-100, and loss becomes 0.Verified experimentally
This is not Qwen-specific — most modern BPE tokenizers (Llama, GPT, etc.) have similar merge rules for common character sequences like
\n\n. Any dataset with leading\nin message content will trigger this.Fix
Strip trailing
\nfrom template strings before tokenizing them in the collator's__init__. This produces shorter, unambiguous token sequences (e.g.[151644, 872]for<|im_start|>user) that are found regardless of what follows in the content.Before:
tokenizer.encode("<|im_start|>user\n")→[151644, 872, 198]— fragile, breaks if content starts with\nAfter:
tokenizer.encode("<|im_start|>user")→[151644, 872]— robust,<|im_start|>is a special token (always151644) and role names are unambiguous after itNo false positives
<|im_start|>(token151644) only ever appears at the start of a chat turn, and is always followed by a role name token. The shorter two-token sequence[151644, 872](<|im_start|>user) cannot match anywhere else in a well-formed chat-template-tokenized sequence. Verified against the actual problematic dataset — finds exactly the correct positions with no false matches.Existing tests pass
All 14 tests in
test_text_completions_collator_with_padding.pyandtest_collators.pycontinue to pass.Changed file
src/oumi/core/collators/trl_data_collator_for_completion_only_lm.py— 2 lines changed (.rstrip("\n")added to bothinstruction_templateandresponse_templatebeforetokenizer.encode)