merge: upstream v3.3.2 by Ordinath · Pull Request #1116 · MemPalace/mempalace

Ordinath · 2026-04-22T21:03:01Z

Summary

Merges upstream milla-jovovich/mempalace main (v3.3.2) into fork. 286 commits, covering releases v3.2.0 through v3.3.2.

Resolved 7 merge conflicts by taking upstream — our custom fixes (sanitize_content, chromadb pin, regex dep, languages module) are all superseded by upstream implementations
Removed orphaned mempalace/languages/ module (replaced by upstream's mempalace/i18n/ JSON-based per-language patterns with full Russian support)
Adapted fork tests for upstream's i18n API (pass languages=("ru",) to entity detection functions)
Updated chromadb dependency from >=0.5.0 to >=1.5.4,<2

Key upstream additions:

i18n entity detection (13 languages including Russian)
HNSW quarantine recovery
PID file guards
Plugin specs (source adapter, storage backend)
Sweeper, exporter, fact-checker modules
VitePress documentation site

Test plan

Full test suite passes (1052 passed, 0 failed)
Palace data integrity verified (search + KG queries work against live data)
Editable install with chromadb 1.5.8 succeeds

Replace Python's `re` module with `regex` for Unicode category support (\p{Lu}, \p{Ll}, \p{L}), making entity detection and AAAK dialect compression work with any script (Cyrillic, Greek, Arabic, CJK, etc.) without maintaining explicit character ranges. Changes: - entity_detector.py: Unicode-aware candidate extraction regex, Russian person/project verb patterns, pronouns, stopwords - dialect.py: Unicode-aware topic extraction and entity detection, Russian emotion signals, flag signals, stop words, decision words - Add `regex>=2024.0.0` dependency (drop-in `re` replacement) - Add 19 tests covering Cyrillic entity detection, mixed-language text, emotion/flag detection, topic extraction, stopword filtering Adding a new language requires only appending keyword blocks — regex patterns are universal via Unicode categories.

Move all language-specific patterns, stopwords, and keyword signals from entity_detector.py and dialect.py into mempalace/languages/ package. Each language (en.py, ru.py) is a self-contained module exporting a standard set of constants. Adding a new language = one file + one import.

- PERSON_VERB_PATTERNS: 13 -> 33 patterns (perception, communication, thought, reflexive verbs, formal address) - ENTITY_STOPWORDS: 80 -> 150+ words (nouns, connectors, adjectives, profanity that appears capitalized at sentence starts) - PRONOUN_PATTERNS: 9 -> 17 (added prepositional forms: ней, нём, них, него, неё, нему, ним + ё/е variant handling) - EMOTION_SIGNALS: 21 -> 40 (pride, shame, exhaustion, inspiration, doubt, despair, peace + profanity-based emotions) - FLAG_SIGNALS: 31 -> 52 (more DECISION/ORIGIN/CORE/PIVOT/TECHNICAL) - TOPIC_STOPWORDS: 60 -> 90+ (verbs, adjectives, profanity) - DECISION_WORDS: 16 -> 25 - PROJECT_VERB_PATTERNS: 9 -> 17 Fixed рассмеялс[яь] regex to рассмеял[аи]?с[яь] (was missing feminine/plural forms). Added ё/е variant handling throughout.

- Pin regex==2026.4.4 (supply chain safety) - Fix misleading "language-neutral" comment on DIALOGUE_PATTERNS - Fix weak assertion in test_key_sentence_russian (always passed) - Add 5 new tests: profanity emotions, profanity stopwords, new person verb patterns, new flag signals, prepositional pronouns - Total: 33 tests, all passing

Replace newline-delimited JSON with proper Content-Length header framing as required by MCP specification for stdio transport. Without this, Claude Agent SDK cannot communicate with the server.

All read-only tools (status, list_wings, list_rooms, get_taxonomy, search, check_duplicate, graph_stats, traverse, find_tunnels, diary_read) failed on a fresh palace because get_collection() throws when the ChromaDB collection doesn't exist yet. Changed _get_collection() default from create=False to create=True in mcp_server.py, and switched searcher.py and palace_graph.py to use get_or_create_collection. An empty collection is valid — read operations return empty results, which is the correct behavior.

MCP SDK v1.28+ switched from Content-Length framing to newline- delimited JSON for stdio transport. The server now auto-detects the transport format: if the first line starts with '{', it reads newline-delimited JSON; otherwise it falls back to Content-Length framing for backwards compatibility.

KnowledgeGraph SQLite was hardcoded to ~/.mempalace/knowledge_graph.sqlite3 which is ephemeral in container environments. Move it to palace_path (same directory as ChromaDB) so it benefits from persistent storage mounts. Swap init order: MempalaceConfig() before KnowledgeGraph() so palace_path is available at KG construction time.

Merge 198 upstream commits including: - fix: add->upsert in convo_miner.py (HNSW bloat prevention) - fix: MCP null args hang, repair infinite recursion, OOM on large files - feat: WAL (write-ahead log) for write audit trail - feat: ChromaDB client caching (singleton pattern) - feat: repair.py and dedup.py modules - feat: mempalace migrate (ChromaDB version recovery) - security: input sanitization, shell injection hardening - fix: query sanitizer (prompt contamination) - docs: honest AAAK stats, benchmark corrections Our custom changes preserved on top: - KG path: always store in palace_path (not just with --palace arg) - _get_collection(create=True): fresh palaces work without errors - Russian language module (languages/ru.py) - Unicode/i18n support (regex-based) Dropped our custom transport layer — upstream's newline-delimited JSON matches MCP SDK 1.28+ (our current version). Content-Length fallback was dead code.

- Update 5 tests that expected errors on empty/missing palace to match our get_or_create_collection behavior (returns empty results, not errors) - Fix mock targets from get_collection to get_or_create_collection - Remove chromadb upper bound (<0.7) to allow chromadb 1.x which is required for Python 3.14 compatibility

The sanitize_name regex only allows [a-zA-Z0-9_ .'-] which is too restrictive for KG triple objects that contain commas, currency symbols, parentheses, and other natural-language characters. Use sanitize_content (length + null-byte check) instead, matching the semantic intent of the object field as free-form text.

Upstream brings: i18n entity detection (replaces our custom languages module), chromadb >=1.5.4 pin, HNSW quarantine recovery, plugin specs (source adapter, storage backend), PID file guards, sweeper, exporter, fact-checker, and VitePress documentation site. Conflicts resolved by taking upstream for all 7 files — our custom changes (sanitize_content fix, chromadb pin, regex dependency, languages module) are superseded by upstream's implementations. Removed orphaned mempalace/languages/ module (replaced by mempalace/i18n/ with JSON-based per-language patterns).

- test_unicode.py: pass languages=("ru",) to extract_candidates and score_entity (upstream defaults to English-only), remove tests for fork-only features (profanity stopwords, Russian emotion/flag keywords in Dialect) that are superseded by upstream's i18n system - test_mcp_server.py: align empty palace test with upstream behavior (create collection before calling status, check total_drawers/wings instead of rooms)

Ordinath added 13 commits April 7, 2026 11:13

fix: use Content-Length framing for MCP stdio transport

9e39405

Replace newline-delimited JSON with proper Content-Length header framing as required by MCP specification for stdio transport. Without this, Claude Agent SDK cannot communicate with the server.

Ordinath requested review from bensig, igorls and milla-jovovich as code owners April 22, 2026 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge: upstream v3.3.2#1116

merge: upstream v3.3.2#1116
Ordinath wants to merge 13 commits intoMemPalace:mainfrom
Ordinath:merge/upstream-v3.3.2

Ordinath commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ordinath commented Apr 22, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant