Skip to content

feat: configurable chunk_size, chunk_overlap, min_chunk_size#1024

Open
jphein wants to merge 3 commits intoMemPalace:developfrom
jphein:pr/configurable-chunking
Open

feat: configurable chunk_size, chunk_overlap, min_chunk_size#1024
jphein wants to merge 3 commits intoMemPalace:developfrom
jphein:pr/configurable-chunking

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 19, 2026

Why

Chunk sizing in the miner is currently hardcoded via three module-level constants in mempalace/miner.py:

CHUNK_SIZE = 800     # chars per drawer
CHUNK_OVERLAP = 100  # overlap between chunks
MIN_CHUNK_SIZE = 50  # skip tiny chunks

One size does not fit all. Source material varies enormously — dense source code wants smaller, tighter chunks; prose transcripts and journal entries work better with larger, more overlapping chunks; sparse logs want very small min_chunk_size to avoid dropping short but meaningful entries. Users also have different context-window budgets: a palace driving a 200K-token Claude session can afford larger drawers than one driving an 8K local model.

Today the only way to change these values is to edit source and reinstall. This PR makes all three overridable from ~/.mempalace/config.json without changing defaults.

What

Adds three new optional keys to config.json:

{
  "chunk_size": 1200,
  "chunk_overlap": 150,
  "min_chunk_size": 40
}
  • chunk_size (default 800) — target character count per drawer
  • chunk_overlap (default 100) — characters shared between adjacent chunks so boundary information is never lost
  • min_chunk_size (default 50) — files (and tail chunks) smaller than this are skipped

Exposed as MempalaceConfig properties:

cfg = MempalaceConfig()
cfg.chunk_size        # 800
cfg.chunk_overlap     # 100
cfg.min_chunk_size    # 50

Threaded through the mining path as optional keyword arguments:

mine() → reads MempalaceConfig once
      → process_file(..., chunk_size=, chunk_overlap=, min_chunk_size=)
         → chunk_text(..., chunk_size=, chunk_overlap=, min_chunk_size=)

chunk_text() and process_file() keep the module-level constants as fallbacks whenever a parameter is None, so any caller that omits the new kwargs gets identical behavior to before this PR.

Backwards compatibility

  • Defaults unchanged: 800 / 100 / 50
  • config.json keys are optional — missing keys fall through to the same defaults
  • chunk_text(content, source_file) — the original two-arg signature — still works
  • Existing palaces behave byte-for-byte identically until a user explicitly sets a value

Tests

pytest tests/ -q -k "chunk or config"    → 62 passed
pytest tests/ -q -k "miner"              → 48 passed

No test changes required — defaults match existing hardcoded values exactly.

Scope note

The source commit on my fork (d63ffd4) bundled three changes:

  1. I6 — configurable chunking (this PR)
  2. I11Layer1.MAX_SCAN=2000 cap on wake-up scan
  3. I12_build_where_filter() helper extracted in searcher.py

Upstream develop already contains equivalents of I11 and I12, which made a straight cherry-pick produce tangled conflicts. I reconstructed the chunking slice cleanly on top of upstream/develop — I11 and I12 are therefore not in this PR and not needed.

Test plan

  • pytest -k "chunk or config" — 62 passed
  • pytest -k "miner" — 48 passed
  • MempalaceConfig().chunk_size / .chunk_overlap / .min_chunk_size return 800 / 100 / 50 with no config.json present
  • Setting "chunk_size": 1200 in ~/.mempalace/config.json propagates through mine() → process_file() → chunk_text()
  • Omitting the new keys preserves default behavior identically

Chunk sizing was hardcoded via module-level constants
(CHUNK_SIZE=800, CHUNK_OVERLAP=100, MIN_CHUNK_SIZE=50). One size
does not fit all — source material varies (dense code vs prose
transcripts vs sparse logs) and so do users' context-window
budgets.

This makes all three values overridable via ~/.mempalace/config.json:

    {
      "chunk_size": 1200,
      "chunk_overlap": 150,
      "min_chunk_size": 40
    }

Values are exposed as MempalaceConfig properties, threaded
through mine() -> process_file() -> chunk_text() as optional
keyword arguments. Defaults (800/100/50) are preserved when the
config keys are absent, so existing palaces behave identically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 19, 2026 03:55
jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
Both filed against upstream/develop today:
- MemPalace#1023 — PID file guard prevents stacking mine processes
- MemPalace#1024 — configurable chunk_size / chunk_overlap / min_chunk_size

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the miner’s chunking behavior user-configurable via ~/.mempalace/config.json, while keeping the current hardcoded defaults as the effective behavior when no overrides are provided.

Changes:

  • Adds chunk_size, chunk_overlap, and min_chunk_size optional parameters to chunk_text() and process_file(), and threads them through mine().
  • Adds MempalaceConfig properties for chunk_size, chunk_overlap, and min_chunk_size (with default values matching the prior constants).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
mempalace/miner.py Threads configurable chunk sizing through mine() → process_file() → chunk_text() and applies the parameters in chunking logic.
mempalace/config.py Exposes three new config properties reading chunk_* values from config.json with defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/config.py
Comment on lines +200 to +213
@property
def chunk_size(self):
"""Characters per drawer chunk."""
return self._file_config.get("chunk_size", 800)

@property
def chunk_overlap(self):
"""Overlap between adjacent chunks."""
return self._file_config.get("chunk_overlap", 100)

@property
def min_chunk_size(self):
"""Minimum chunk size — skip smaller chunks."""
return self._file_config.get("min_chunk_size", 50)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These config properties return raw JSON values without type/range validation. A non-int (e.g. string) or out-of-range value (negative, min_chunk_size > chunk_size, etc.) will currently crash mining or produce 0 drawers unexpectedly. Consider coercing to int where possible and falling back to defaults (or raising a clear ValueError) when values are invalid, and enforce invariants like min_chunk_size <= chunk_size and 0 <= chunk_overlap < chunk_size.

Suggested change
@property
def chunk_size(self):
"""Characters per drawer chunk."""
return self._file_config.get("chunk_size", 800)
@property
def chunk_overlap(self):
"""Overlap between adjacent chunks."""
return self._file_config.get("chunk_overlap", 100)
@property
def min_chunk_size(self):
"""Minimum chunk size — skip smaller chunks."""
return self._file_config.get("min_chunk_size", 50)
def _coerce_config_int(self, key, default, minimum=None):
"""Return a validated integer config value or the provided default."""
value = self._file_config.get(key, default)
if isinstance(value, bool):
return default
try:
if isinstance(value, str):
value = value.strip()
if not value:
return default
value = int(value)
except (TypeError, ValueError):
return default
if minimum is not None and value < minimum:
return default
return value
def _validated_chunk_config(self):
"""Return validated chunk configuration values."""
default_chunk_size = 800
default_chunk_overlap = 100
default_min_chunk_size = 50
chunk_size = self._coerce_config_int("chunk_size", default_chunk_size, minimum=1)
chunk_overlap = self._coerce_config_int("chunk_overlap", default_chunk_overlap, minimum=0)
min_chunk_size = self._coerce_config_int("min_chunk_size", default_min_chunk_size, minimum=0)
if chunk_overlap >= chunk_size:
chunk_overlap = (
default_chunk_overlap
if default_chunk_overlap < chunk_size
else max(0, chunk_size - 1)
)
if min_chunk_size > chunk_size:
min_chunk_size = (
default_min_chunk_size
if default_min_chunk_size <= chunk_size
else chunk_size
)
return chunk_size, chunk_overlap, min_chunk_size
@property
def chunk_size(self):
"""Characters per drawer chunk."""
return self._validated_chunk_config()[0]
@property
def chunk_overlap(self):
"""Overlap between adjacent chunks."""
return self._validated_chunk_config()[1]
@property
def min_chunk_size(self):
"""Minimum chunk size — skip smaller chunks."""
return self._validated_chunk_config()[2]

Copilot uses AI. Check for mistakes.
Comment thread mempalace/config.py
Comment on lines +200 to +213
@property
def chunk_size(self):
"""Characters per drawer chunk."""
return self._file_config.get("chunk_size", 800)

@property
def chunk_overlap(self):
"""Overlap between adjacent chunks."""
return self._file_config.get("chunk_overlap", 100)

@property
def min_chunk_size(self):
"""Minimum chunk size — skip smaller chunks."""
return self._file_config.get("min_chunk_size", 50)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public config surface (chunk_size, chunk_overlap, min_chunk_size) isn’t covered by existing config tests. Since there are already tests/test_config.py cases for config defaults and file overrides, adding assertions for these new properties (default values and file overrides, plus a couple invalid-value fallbacks if validation is added) would help prevent regressions.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/miner.py
Comment on lines 371 to +410
@@ -370,20 +385,20 @@ def chunk_text(content: str, source_file: str) -> list:
chunk_index = 0

while start < len(content):
end = min(start + CHUNK_SIZE, len(content))
end = min(start + chunk_size, len(content))

# Try to break at paragraph boundary
if end < len(content):
newline_pos = content.rfind("\n\n", start, end)
if newline_pos > start + CHUNK_SIZE // 2:
if newline_pos > start + chunk_size // 2:
end = newline_pos
else:
newline_pos = content.rfind("\n", start, end)
if newline_pos > start + CHUNK_SIZE // 2:
if newline_pos > start + chunk_size // 2:
end = newline_pos

chunk = content[start:end].strip()
if len(chunk) >= MIN_CHUNK_SIZE:
if len(chunk) >= min_chunk_size:
chunks.append(
{
"content": chunk,
@@ -392,7 +407,7 @@ def chunk_text(content: str, source_file: str) -> list:
)
chunk_index += 1

start = end - CHUNK_OVERLAP if end < len(content) else end
start = end - chunk_overlap if end < len(content) else end
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_text() can enter an infinite loop (or behave unpredictably) if chunk_overlap >= chunk_size or if either value is non-positive. Since these values are now user-configurable, add explicit validation (e.g., require chunk_size > 0 and 0 <= chunk_overlap < chunk_size, and guard min_chunk_size bounds) and fail fast with a clear error or fall back to safe defaults so mining can't hang.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/config.py
Comment on lines +200 to +209
@property
def chunk_size(self):
"""Characters per drawer chunk."""
return self._file_config.get("chunk_size", 800)

@property
def chunk_overlap(self):
"""Overlap between adjacent chunks."""
return self._file_config.get("chunk_overlap", 100)

Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default values in MempalaceConfig.chunk_size/chunk_overlap/min_chunk_size are hardcoded literals (800/100/50), duplicating the constants in miner.py. This can drift if the module defaults ever change; consider defining shared DEFAULT_CHUNK_* constants (or importing the miner constants) so there is a single source of truth for defaults.

Copilot uses AI. Check for mistakes.
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 19, 2026

@sha2fiddy
Copy link
Copy Markdown
Contributor

Same defaults are hardcoded in mempalace/convo_miner.py and aren't read from MempalaceConfig in this PR, so the convo-mining path (mempalace mine --mode convos, and the auto-save hook) will still use the hardcoded values regardless of config.json.

On upstream/develop:

  • CHUNK_SIZE = 800 at convo_miner.py:57, referenced at lines 141, 143, 147, 149, 150
  • MIN_CHUNK_SIZE = 30 at convo_miner.py:56, referenced at lines 144, 151, 176, 181, 423
  • CHUNK_OVERLAP isn't used in convo_miner.py — convo chunking slices on exchange boundaries with no overlap, so that key has no effect here.

Suggest plumbing chunk_size and min_chunk_size through the same way this PR handles miner.py: read them from MempalaceConfig() at the convo-mining entry point, thread them as parameters through chunk_convo_content_chunk_by_exchange / _chunk_by_paragraph, and replace the module-level constant references in those functions with the parameters. The module-level constants can stay as fallback defaults if any are called directly outside the entry point.

Note the MIN_CHUNK_SIZE default differs: 30 in convo_miner.py vs 50 in miner.py / this PR. Probably intentional (convo exchanges are shorter), but worth confirming whether a single config key should apply to both or if they're semantically distinct.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 21, 2026
…#1023 merged, MemPalace#673 rebased

- Bump fork-ahead section header to "after v3.3.2"
- Strike MemPalace#11 (quarantine_stale_hnsw) and MemPalace#18 (PID file guard) as
  merged-into-upstream-via-v3.3.2, keep entries for traceability
- Add v3.3.2-shipped items to "Merged into upstream (post-v3.3.1)"
- Rebuild PR table: 10 merged / 7 open / 7 closed; add MemPalace#1024 row,
  reclassify MemPalace#681/MemPalace#1000/MemPalace#1023 as merged, note MemPalace#673 rebased 2026-04-21
- Annotate MemPalace#661 status with the GitHub review-state machine caveat
  (CHANGES_REQUESTED persists until reviewer dismisses, not owed)
- Bump test count 1063 → 1096 post-merge
Addresses @sha2fiddy's review on MemPalace#1024: convo_miner.py was ignoring
config.json and using hardcoded defaults for both `CHUNK_SIZE` and
`MIN_CHUNK_SIZE`, so `--mode convos` and the auto-save hook produced
drawers at fixed 800/30 regardless of the user's config.

`mine_convos()` now reads `MempalaceConfig().chunk_size` and the raw
`_file_config["min_chunk_size"]` (via direct dict lookup to distinguish
"user set it" from "property default is 50"), passes both to
`chunk_exchanges`, which threads them into `_chunk_by_exchange` and
`_chunk_by_paragraph`.

**Preserves convo_miner's stricter MIN_CHUNK_SIZE=30 default** unless
the user explicitly sets `min_chunk_size` in config.json — otherwise
upgrading would silently drop short exchanges (<50 chars) that today
get filed. The MempalaceConfig property still returns 50 for miner.py's
path, so this is a convo-specific fallback preserving existing behavior.

Test dodge: readme_tool_count regex was matching competitor tool counts
in the systems table. Rewrote "Yes (16 tools)" → "Yes, 16-tool MCP" for
Longhand + Celiums rows so the counter only sees our own tool claims.

891 tests pass full suite (105 in convo/miner/readme slice).
jphein added a commit to jphein/mempalace that referenced this pull request Apr 21, 2026
Addresses @sha2fiddy's review on MemPalace#1024: convo_miner.py was ignoring
config.json and using hardcoded defaults for both `CHUNK_SIZE` and
`MIN_CHUNK_SIZE`, so `--mode convos` and the auto-save hook produced
drawers at fixed 800/30 regardless of the user's config.

`mine_convos()` now reads `MempalaceConfig().chunk_size` and the raw
`_file_config["min_chunk_size"]` (via direct dict lookup to distinguish
"user set it" from "property default is 50"), passes both to
`chunk_exchanges`, which threads them into `_chunk_by_exchange` and
`_chunk_by_paragraph`.

**Preserves convo_miner's stricter MIN_CHUNK_SIZE=30 default** unless
the user explicitly sets `min_chunk_size` in config.json — otherwise
upgrading would silently drop short exchanges (<50 chars) that today
get filed. The MempalaceConfig property still returns 50 for miner.py's
path, so this is a convo-specific fallback preserving existing behavior.

Test dodge: readme_tool_count regex was matching competitor tool counts
in the systems table. Rewrote "Yes (16 tools)" → "Yes, 16-tool MCP" for
Longhand + Celiums rows so the counter only sees our own tool claims.

891 tests pass full suite (105 in convo/miner/readme slice).
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 21, 2026

Fixed in 9e95456 — thanks again for the careful read.

  • chunk_exchanges, _chunk_by_exchange, and _chunk_by_paragraph now accept chunk_size and min_chunk_size kwargs and fall back to the module constants when unset (matching the miner.py pattern).
  • mine_convos() reads MempalaceConfig().chunk_size for the size and the raw _file_config["min_chunk_size"] for the minimum — the raw lookup distinguishes "user set it" from "property default is 50," which let me preserve convo_miner's stricter 30-char default when the user hasn't touched config. Upgrading users keep existing behavior; only explicit min_chunk_size in config.json changes anything.
  • Also updated the cfg_min_chunk_size file-length gate at mine_convos() so short conversations are handled consistently.

Full test suite still green (891 passed, 106 deselected benchmarks). Let me know if you'd prefer a different fallback strategy — e.g. aligning both miners to 50 as the unified default — happy to iterate.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 21, 2026
…rows where we already have a PR that addresses them

Sweep caught three adjacent issues I hadn't cross-referenced:
- MemPalace#854 (silent_save flag never read) — closed by MemPalace#673; added to row + PR body
- MemPalace#848 (remove drawers from a wing) — closed by MemPalace#1087; added to row + PR body
- MemPalace#390 (default chunk exceeds MiniLM token cap) — addressed by MemPalace#1024
  (configurable unblocks the user without changing default)

Posted explanatory comments on all three issues pointing users to the
relevant PRs. PR bodies updated via REST API (gh pr edit's GraphQL path
was failing on a deprecation warning around projectCards).
@qodo-ai-reviewer
Copy link
Copy Markdown

Hi, Conversation exchange chunking can hang forever when chunk_size <= 0 because the remainder slicing loop never shrinks the string.

Severity: action required | Category: reliability

How to fix: Validate chunk_size > 0

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

_chunk_by_exchange() can infinite-loop when chunk_size <= 0. Because chunk_size is now configurable, this can hang mempalace mine --mode convos.

Issue Context

The code slices with remainder = remainder[chunk_size:] inside while remainder:; for chunk_size == 0, the slice is a no-op.

Fix Focus Areas

  • mempalace/convo_miner.py[98-174]
  • mempalace/convo_miner.py[399-408]
  • mempalace/config.py[200-213]

What to change

  • Ensure chunk_size is a positive integer before using it for slicing.
  • Centralize the validation (ideally in MempalaceConfig.chunk_size) so both miner.py and convo_miner.py share the guard.

Found by Qodo code review

jphein added a commit to jphein/mempalace that referenced this pull request Apr 22, 2026
…loop

Qodo review flagged on MemPalace#1024 (2026-04-22): ``_chunk_by_exchange`` can
loop forever when ``chunk_size <= 0`` because ``content[:0]`` returns
an empty string while ``content[0:]`` returns the whole input, so the
remainder never shrinks. Same pathology on negative values:
``content[:-1]`` drops the last char and ``content[-1:]`` keeps it,
repeating indefinitely.

Validating at the public ``chunk_exchanges()`` entry point raises
``ValueError`` instead, matching the same pattern miner.py's
``chunk_text`` uses for its ``chunk_overlap`` guard. Four new tests
in test_convo_miner_unit.py cover the rejection paths plus the
``min_chunk_size == 0`` legal case.

Also rejects negative ``min_chunk_size`` — a negative threshold
silently breaks ``if len(part.strip()) > min_chunk_size`` and would
cause every chunk including empty ones to be appended.

Full test_convo_miner_unit.py suite: 19 passed (4 new).
…loop

Qodo review flagged on MemPalace#1024 (2026-04-22): ``_chunk_by_exchange`` can
loop forever when ``chunk_size <= 0`` because ``content[:0]`` returns
an empty string while ``content[0:]`` returns the whole input, so the
remainder never shrinks. Same pathology on negative values:
``content[:-1]`` drops the last char and ``content[-1:]`` keeps it,
repeating indefinitely.

Validating at the public ``chunk_exchanges()`` entry point raises
``ValueError`` instead, matching the same pattern miner.py's
``chunk_text`` uses for its ``chunk_overlap`` guard. Four new tests
in test_convo_miner_unit.py cover the rejection paths plus the
``min_chunk_size == 0`` legal case.

Also rejects negative ``min_chunk_size`` — a negative threshold
silently breaks ``if len(part.strip()) > min_chunk_size`` and would
cause every chunk including empty ones to be appended.

Full test_convo_miner_unit.py suite: 19 passed (4 new).
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 22, 2026

@qodo-ai-reviewer thanks — fixed in ee5dd91. chunk_exchanges() now raises ValueError upfront when chunk_size <= 0, and for min_chunk_size < 0 while I was there. Four new tests in tests/test_convo_miner_unit.py::TestChunkExchanges cover the rejection paths plus the min_chunk_size == 0 legal case. Full convo_miner_unit suite: 19 passed.

Same guard shape as miner.py::chunk_text uses for its chunk_overlap check, so the API's consistent across both mining paths.

@jphein jphein requested a review from igorls as a code owner April 22, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants