Open
Conversation
Replace Python's `re` module with `regex` for Unicode category support
(\p{Lu}, \p{Ll}, \p{L}), making entity detection and AAAK dialect
compression work with any script (Cyrillic, Greek, Arabic, CJK, etc.)
without maintaining explicit character ranges.
Changes:
- entity_detector.py: Unicode-aware candidate extraction regex,
Russian person/project verb patterns, pronouns, stopwords
- dialect.py: Unicode-aware topic extraction and entity detection,
Russian emotion signals, flag signals, stop words, decision words
- Add `regex>=2024.0.0` dependency (drop-in `re` replacement)
- Add 19 tests covering Cyrillic entity detection, mixed-language
text, emotion/flag detection, topic extraction, stopword filtering
Adding a new language requires only appending keyword blocks —
regex patterns are universal via Unicode categories.
Move all language-specific patterns, stopwords, and keyword signals from entity_detector.py and dialect.py into mempalace/languages/ package. Each language (en.py, ru.py) is a self-contained module exporting a standard set of constants. Adding a new language = one file + one import.
- PERSON_VERB_PATTERNS: 13 -> 33 patterns (perception, communication, thought, reflexive verbs, formal address) - ENTITY_STOPWORDS: 80 -> 150+ words (nouns, connectors, adjectives, profanity that appears capitalized at sentence starts) - PRONOUN_PATTERNS: 9 -> 17 (added prepositional forms: ней, нём, них, него, неё, нему, ним + ё/е variant handling) - EMOTION_SIGNALS: 21 -> 40 (pride, shame, exhaustion, inspiration, doubt, despair, peace + profanity-based emotions) - FLAG_SIGNALS: 31 -> 52 (more DECISION/ORIGIN/CORE/PIVOT/TECHNICAL) - TOPIC_STOPWORDS: 60 -> 90+ (verbs, adjectives, profanity) - DECISION_WORDS: 16 -> 25 - PROJECT_VERB_PATTERNS: 9 -> 17 Fixed рассмеялс[яь] regex to рассмеял[аи]?с[яь] (was missing feminine/plural forms). Added ё/е variant handling throughout.
- Pin regex==2026.4.4 (supply chain safety) - Fix misleading "language-neutral" comment on DIALOGUE_PATTERNS - Fix weak assertion in test_key_sentence_russian (always passed) - Add 5 new tests: profanity emotions, profanity stopwords, new person verb patterns, new flag signals, prepositional pronouns - Total: 33 tests, all passing
Replace newline-delimited JSON with proper Content-Length header framing as required by MCP specification for stdio transport. Without this, Claude Agent SDK cannot communicate with the server.
All read-only tools (status, list_wings, list_rooms, get_taxonomy, search, check_duplicate, graph_stats, traverse, find_tunnels, diary_read) failed on a fresh palace because get_collection() throws when the ChromaDB collection doesn't exist yet. Changed _get_collection() default from create=False to create=True in mcp_server.py, and switched searcher.py and palace_graph.py to use get_or_create_collection. An empty collection is valid — read operations return empty results, which is the correct behavior.
MCP SDK v1.28+ switched from Content-Length framing to newline-
delimited JSON for stdio transport. The server now auto-detects the
transport format: if the first line starts with '{', it reads
newline-delimited JSON; otherwise it falls back to Content-Length
framing for backwards compatibility.
KnowledgeGraph SQLite was hardcoded to ~/.mempalace/knowledge_graph.sqlite3 which is ephemeral in container environments. Move it to palace_path (same directory as ChromaDB) so it benefits from persistent storage mounts. Swap init order: MempalaceConfig() before KnowledgeGraph() so palace_path is available at KG construction time.
Merge 198 upstream commits including: - fix: add->upsert in convo_miner.py (HNSW bloat prevention) - fix: MCP null args hang, repair infinite recursion, OOM on large files - feat: WAL (write-ahead log) for write audit trail - feat: ChromaDB client caching (singleton pattern) - feat: repair.py and dedup.py modules - feat: mempalace migrate (ChromaDB version recovery) - security: input sanitization, shell injection hardening - fix: query sanitizer (prompt contamination) - docs: honest AAAK stats, benchmark corrections Our custom changes preserved on top: - KG path: always store in palace_path (not just with --palace arg) - _get_collection(create=True): fresh palaces work without errors - Russian language module (languages/ru.py) - Unicode/i18n support (regex-based) Dropped our custom transport layer — upstream's newline-delimited JSON matches MCP SDK 1.28+ (our current version). Content-Length fallback was dead code.
- Update 5 tests that expected errors on empty/missing palace to match our get_or_create_collection behavior (returns empty results, not errors) - Fix mock targets from get_collection to get_or_create_collection - Remove chromadb upper bound (<0.7) to allow chromadb 1.x which is required for Python 3.14 compatibility
The sanitize_name regex only allows [a-zA-Z0-9_ .'-] which is too restrictive for KG triple objects that contain commas, currency symbols, parentheses, and other natural-language characters. Use sanitize_content (length + null-byte check) instead, matching the semantic intent of the object field as free-form text.
Upstream brings: i18n entity detection (replaces our custom languages module), chromadb >=1.5.4 pin, HNSW quarantine recovery, plugin specs (source adapter, storage backend), PID file guards, sweeper, exporter, fact-checker, and VitePress documentation site. Conflicts resolved by taking upstream for all 7 files — our custom changes (sanitize_content fix, chromadb pin, regex dependency, languages module) are superseded by upstream's implementations. Removed orphaned mempalace/languages/ module (replaced by mempalace/i18n/ with JSON-based per-language patterns).
- test_unicode.py: pass languages=("ru",) to extract_candidates and
score_entity (upstream defaults to English-only), remove tests for
fork-only features (profanity stopwords, Russian emotion/flag keywords
in Dialect) that are superseded by upstream's i18n system
- test_mcp_server.py: align empty palace test with upstream behavior
(create collection before calling status, check total_drawers/wings
instead of rooms)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merges upstream milla-jovovich/mempalace main (v3.3.2) into fork. 286 commits, covering releases v3.2.0 through v3.3.2.
mempalace/languages/module (replaced by upstream'smempalace/i18n/JSON-based per-language patterns with full Russian support)languages=("ru",)to entity detection functions)>=0.5.0to>=1.5.4,<2Key upstream additions:
Test plan