Skip to content

fix(normalize): handle UTF-8 BOM in transcript files#1102

Open
arnoldwender wants to merge 1 commit intoMemPalace:developfrom
arnoldwender:fix/normalize-utf8-bom
Open

fix(normalize): handle UTF-8 BOM in transcript files#1102
arnoldwender wants to merge 1 commit intoMemPalace:developfrom
arnoldwender:fix/normalize-utf8-bom

Conversation

@arnoldwender
Copy link
Copy Markdown
Contributor

What and Why

normalize() opens transcript files with encoding='utf-8'. When a file has a UTF-8 BOM prefix (\xef\xbb\xbf) — common from Windows exports of Claude Code JSONL sessions — json.loads() raises JSONDecodeError on every line because the first line starts with {. _try_claude_code_jsonl silently skips all lines and the file falls through as raw unstructured text, discarding all message content.

Root Cause

normalize.py:124encoding="utf-8" does not strip the BOM; encoding="utf-8-sig" does.

Fix

# Before
with open(filepath, "r", encoding="utf-8", errors="replace") as f:

# After
with open(filepath, "r", encoding="utf-8-sig", errors="replace") as f:

Python's utf-8-sig codec strips the BOM on read and is fully backward-compatible with BOM-free files. No behavioral change on Linux/macOS files.

Reproduction

import tempfile, os
from mempalace.normalize import normalize

# BOM-prefixed JSONL (Windows Claude Code export)
bom_file = tempfile.mktemp(suffix=".jsonl")
with open(bom_file, "wb") as f:
    f.write(b"\xef\xbb\xbf")  # UTF-8 BOM
    f.write(b'{"type":"human","message":{"content":"hello"}}\n')

result_before = normalize(bom_file)  # returns raw JSON lines, not a transcript
# After fix: returns properly formatted transcript with "hello"

Tests

All 107 test_normalize.py tests pass. All 1066 tests pass.

Closes #1034 (partial — encoding fix for normalize.py; other files addressed in separate PRs)

Windows exports of Claude Code JSONL sessions prepend a UTF-8 BOM
(\xef\xbb\xbf). With encoding='utf-8', json.loads() raises JSONDecodeError
on the first line, _try_claude_code_jsonl silently skips every line, and
the file falls through as raw text — losing all structured message content.

utf-8-sig strips the BOM transparently and is backward-compatible with
BOM-free files on all platforms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Windows] UnicodeEncodeError when running mempalace mine on GBK console

1 participant