Skip to content

merge: upstream v3.3.2#1116

Open
Ordinath wants to merge 13 commits intoMemPalace:mainfrom
Ordinath:merge/upstream-v3.3.2
Open

merge: upstream v3.3.2#1116
Ordinath wants to merge 13 commits intoMemPalace:mainfrom
Ordinath:merge/upstream-v3.3.2

Conversation

@Ordinath
Copy link
Copy Markdown

Summary

Merges upstream milla-jovovich/mempalace main (v3.3.2) into fork. 286 commits, covering releases v3.2.0 through v3.3.2.

  • Resolved 7 merge conflicts by taking upstream — our custom fixes (sanitize_content, chromadb pin, regex dep, languages module) are all superseded by upstream implementations
  • Removed orphaned mempalace/languages/ module (replaced by upstream's mempalace/i18n/ JSON-based per-language patterns with full Russian support)
  • Adapted fork tests for upstream's i18n API (pass languages=("ru",) to entity detection functions)
  • Updated chromadb dependency from >=0.5.0 to >=1.5.4,<2

Key upstream additions:

  • i18n entity detection (13 languages including Russian)
  • HNSW quarantine recovery
  • PID file guards
  • Plugin specs (source adapter, storage backend)
  • Sweeper, exporter, fact-checker modules
  • VitePress documentation site

Test plan

  • Full test suite passes (1052 passed, 0 failed)
  • Palace data integrity verified (search + KG queries work against live data)
  • Editable install with chromadb 1.5.8 succeeds

Ordinath added 13 commits April 7, 2026 11:13
Replace Python's `re` module with `regex` for Unicode category support
(\p{Lu}, \p{Ll}, \p{L}), making entity detection and AAAK dialect
compression work with any script (Cyrillic, Greek, Arabic, CJK, etc.)
without maintaining explicit character ranges.

Changes:
- entity_detector.py: Unicode-aware candidate extraction regex,
  Russian person/project verb patterns, pronouns, stopwords
- dialect.py: Unicode-aware topic extraction and entity detection,
  Russian emotion signals, flag signals, stop words, decision words
- Add `regex>=2024.0.0` dependency (drop-in `re` replacement)
- Add 19 tests covering Cyrillic entity detection, mixed-language
  text, emotion/flag detection, topic extraction, stopword filtering

Adding a new language requires only appending keyword blocks —
regex patterns are universal via Unicode categories.
Move all language-specific patterns, stopwords, and keyword signals
from entity_detector.py and dialect.py into mempalace/languages/ package.
Each language (en.py, ru.py) is a self-contained module exporting a
standard set of constants. Adding a new language = one file + one import.
- PERSON_VERB_PATTERNS: 13 -> 33 patterns (perception, communication,
  thought, reflexive verbs, formal address)
- ENTITY_STOPWORDS: 80 -> 150+ words (nouns, connectors, adjectives,
  profanity that appears capitalized at sentence starts)
- PRONOUN_PATTERNS: 9 -> 17 (added prepositional forms: ней, нём, них,
  него, неё, нему, ним + ё/е variant handling)
- EMOTION_SIGNALS: 21 -> 40 (pride, shame, exhaustion, inspiration,
  doubt, despair, peace + profanity-based emotions)
- FLAG_SIGNALS: 31 -> 52 (more DECISION/ORIGIN/CORE/PIVOT/TECHNICAL)
- TOPIC_STOPWORDS: 60 -> 90+ (verbs, adjectives, profanity)
- DECISION_WORDS: 16 -> 25
- PROJECT_VERB_PATTERNS: 9 -> 17

Fixed рассмеялс[яь] regex to рассмеял[аи]?с[яь] (was missing
feminine/plural forms). Added ё/е variant handling throughout.
- Pin regex==2026.4.4 (supply chain safety)
- Fix misleading "language-neutral" comment on DIALOGUE_PATTERNS
- Fix weak assertion in test_key_sentence_russian (always passed)
- Add 5 new tests: profanity emotions, profanity stopwords,
  new person verb patterns, new flag signals, prepositional pronouns
- Total: 33 tests, all passing
Replace newline-delimited JSON with proper Content-Length header
framing as required by MCP specification for stdio transport.
Without this, Claude Agent SDK cannot communicate with the server.
All read-only tools (status, list_wings, list_rooms, get_taxonomy,
search, check_duplicate, graph_stats, traverse, find_tunnels,
diary_read) failed on a fresh palace because get_collection() throws
when the ChromaDB collection doesn't exist yet.

Changed _get_collection() default from create=False to create=True
in mcp_server.py, and switched searcher.py and palace_graph.py to
use get_or_create_collection. An empty collection is valid — read
operations return empty results, which is the correct behavior.
MCP SDK v1.28+ switched from Content-Length framing to newline-
delimited JSON for stdio transport. The server now auto-detects the
transport format: if the first line starts with '{', it reads
newline-delimited JSON; otherwise it falls back to Content-Length
framing for backwards compatibility.
KnowledgeGraph SQLite was hardcoded to ~/.mempalace/knowledge_graph.sqlite3
which is ephemeral in container environments. Move it to palace_path
(same directory as ChromaDB) so it benefits from persistent storage mounts.

Swap init order: MempalaceConfig() before KnowledgeGraph() so palace_path
is available at KG construction time.
Merge 198 upstream commits including:
- fix: add->upsert in convo_miner.py (HNSW bloat prevention)
- fix: MCP null args hang, repair infinite recursion, OOM on large files
- feat: WAL (write-ahead log) for write audit trail
- feat: ChromaDB client caching (singleton pattern)
- feat: repair.py and dedup.py modules
- feat: mempalace migrate (ChromaDB version recovery)
- security: input sanitization, shell injection hardening
- fix: query sanitizer (prompt contamination)
- docs: honest AAAK stats, benchmark corrections

Our custom changes preserved on top:
- KG path: always store in palace_path (not just with --palace arg)
- _get_collection(create=True): fresh palaces work without errors
- Russian language module (languages/ru.py)
- Unicode/i18n support (regex-based)

Dropped our custom transport layer — upstream's newline-delimited JSON
matches MCP SDK 1.28+ (our current version). Content-Length fallback
was dead code.
- Update 5 tests that expected errors on empty/missing palace to match
  our get_or_create_collection behavior (returns empty results, not errors)
- Fix mock targets from get_collection to get_or_create_collection
- Remove chromadb upper bound (<0.7) to allow chromadb 1.x which is
  required for Python 3.14 compatibility
The sanitize_name regex only allows [a-zA-Z0-9_ .'-] which is too
restrictive for KG triple objects that contain commas, currency symbols,
parentheses, and other natural-language characters. Use sanitize_content
(length + null-byte check) instead, matching the semantic intent of the
object field as free-form text.
Upstream brings: i18n entity detection (replaces our custom languages module),
chromadb >=1.5.4 pin, HNSW quarantine recovery, plugin specs (source adapter,
storage backend), PID file guards, sweeper, exporter, fact-checker, and
VitePress documentation site.

Conflicts resolved by taking upstream for all 7 files — our custom changes
(sanitize_content fix, chromadb pin, regex dependency, languages module) are
superseded by upstream's implementations. Removed orphaned mempalace/languages/
module (replaced by mempalace/i18n/ with JSON-based per-language patterns).
- test_unicode.py: pass languages=("ru",) to extract_candidates and
  score_entity (upstream defaults to English-only), remove tests for
  fork-only features (profanity stopwords, Russian emotion/flag keywords
  in Dialect) that are superseded by upstream's i18n system
- test_mcp_server.py: align empty palace test with upstream behavior
  (create collection before calling status, check total_drawers/wings
  instead of rooms)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant