fix(normalize): handle UTF-8 BOM in transcript files#1102
Open
arnoldwender wants to merge 1 commit intoMemPalace:developfrom
Open
fix(normalize): handle UTF-8 BOM in transcript files#1102arnoldwender wants to merge 1 commit intoMemPalace:developfrom
arnoldwender wants to merge 1 commit intoMemPalace:developfrom
Conversation
Windows exports of Claude Code JSONL sessions prepend a UTF-8 BOM (\xef\xbb\xbf). With encoding='utf-8', json.loads() raises JSONDecodeError on the first line, _try_claude_code_jsonl silently skips every line, and the file falls through as raw text — losing all structured message content. utf-8-sig strips the BOM transparently and is backward-compatible with BOM-free files on all platforms.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What and Why
normalize()opens transcript files withencoding='utf-8'. When a file has a UTF-8 BOM prefix (\xef\xbb\xbf) — common from Windows exports of Claude Code JSONL sessions —json.loads()raisesJSONDecodeErroron every line because the first line starts with{._try_claude_code_jsonlsilently skips all lines and the file falls through as raw unstructured text, discarding all message content.Root Cause
normalize.py:124—encoding="utf-8"does not strip the BOM;encoding="utf-8-sig"does.Fix
Python's
utf-8-sigcodec strips the BOM on read and is fully backward-compatible with BOM-free files. No behavioral change on Linux/macOS files.Reproduction
Tests
All 107
test_normalize.pytests pass. All 1066 tests pass.Closes #1034 (partial — encoding fix for
normalize.py; other files addressed in separate PRs)