fix(normalize): handle UTF-8 BOM in transcript files by arnoldwender · Pull Request #1102 · MemPalace/mempalace

arnoldwender · 2026-04-22T10:18:38Z

What and Why

normalize() opens transcript files with encoding='utf-8'. When a file has a UTF-8 BOM prefix (\xef\xbb\xbf) — common from Windows exports of Claude Code JSONL sessions — json.loads() raises JSONDecodeError on every line because the first line starts with {. _try_claude_code_jsonl silently skips all lines and the file falls through as raw unstructured text, discarding all message content.

Root Cause

normalize.py:124 — encoding="utf-8" does not strip the BOM; encoding="utf-8-sig" does.

Fix

# Before
with open(filepath, "r", encoding="utf-8", errors="replace") as f:

# After
with open(filepath, "r", encoding="utf-8-sig", errors="replace") as f:

Python's utf-8-sig codec strips the BOM on read and is fully backward-compatible with BOM-free files. No behavioral change on Linux/macOS files.

Reproduction

import tempfile, os
from mempalace.normalize import normalize

# BOM-prefixed JSONL (Windows Claude Code export)
bom_file = tempfile.mktemp(suffix=".jsonl")
with open(bom_file, "wb") as f:
    f.write(b"\xef\xbb\xbf")  # UTF-8 BOM
    f.write(b'{"type":"human","message":{"content":"hello"}}\n')

result_before = normalize(bom_file)  # returns raw JSON lines, not a transcript
# After fix: returns properly formatted transcript with "hello"

Tests

All 107 test_normalize.py tests pass. All 1066 tests pass.

Closes #1034 (partial — encoding fix for normalize.py; other files addressed in separate PRs)

Windows exports of Claude Code JSONL sessions prepend a UTF-8 BOM (\xef\xbb\xbf). With encoding='utf-8', json.loads() raises JSONDecodeError on the first line, _try_claude_code_jsonl silently skips every line, and the file falls through as raw text — losing all structured message content. utf-8-sig strips the BOM transparently and is backward-compatible with BOM-free files on all platforms.

arnoldwender requested review from bensig and milla-jovovich as code owners April 22, 2026 10:18

arnoldwender mentioned this pull request Apr 22, 2026

fix(encoding): replace non-ASCII symbols in CLI output (#1034) #1104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(normalize): handle UTF-8 BOM in transcript files#1102

fix(normalize): handle UTF-8 BOM in transcript files#1102
arnoldwender wants to merge 1 commit intoMemPalace:developfrom
arnoldwender:fix/normalize-utf8-bom

arnoldwender commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arnoldwender commented Apr 22, 2026

What and Why

Root Cause

Fix

Reproduction

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant