-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
What feature would you like to see?
Status: Draft
Target: Codex CLI + IDE extension (shared session format)
0. Summary
Replace lossy "conversation summarization" compaction with a deterministic, host-generated checkpoint.
Key idea:
- The Codex CLI already writes a structured per-session event log:
$CODEX_HOME/sessions/YYYY/MM/DD/rollout-*.jsonl. - We add a tiny derived projection:
checkpoint_v1.json. - Compaction becomes a local operation: reset context + inject
view_v1(checkpoint, caps).
No LLM is required to produce the checkpoint.
1. Motivation / Problem
After compaction (manual or auto), Codex often:
- re-reads the same files,
- re-derives already known facts,
- loses awareness of recent edits or task pointer.
Users report that auto-compaction can reset the model's working state and it may only retain a lossy "memento" summary instead of tool-call history and concrete actions.
This wastes tokens/time and degrades UX.
Related reports:
- "memento summary instead of full tool call history" and the model forgetting edits mid-task: Auto compaction causes GPT-5-Codex to lose the plot. It forgets it is mid-task, forgets it has edited files and stops. #5957
- compaction fails or does not help: /compact does not work. #4813, /compact doesn't correctly optimize the context (v0.63.0) #7232
- compaction loop / hangs: Context compaction loop leaves ~5% context and stalls session (eventually 0% / interrupted) #8365, Codex agent is stuck in compaction loop #8481, Context compaction stalls and the session hangs after large tool outputs (codex-cli 0.74.0) #8402
- resume loses task intent; request for explicit task pointer + resume checkpoint: Bug report: Session resume after rate limit loses task intent and continues on wrong context #8310
2. Goals
- G1. Make compaction "state checkpoint", not "narrative summary".
- G2. Compaction must succeed even when context is full (no extra model call required).
- G3. Deterministic + testable: given the same inputs, checkpoint and view bytes are identical.
- G4. Dramatically reduce redundant file re-reads after compaction when artifacts are unchanged.
- G5. No silent wrongness: stale derived facts must become
SUSPECTautomatically. - G6. Keep it bounded and cheap: stable caps, stable formatting.
3. Non-Goals
- N1. Cross-session long-term memory / personal preferences.
- N2. Storing "agent behavior policy" persistently (prompt-injection risk).
- N3. Semantic search / retrieval ranking inside the checkpoint (can be an optional later layer).
- N4. A new logging system: we reuse rollout JSONL.
4. Architecture (high-level)
Existing:
rollout-*.jsonl (already produced by CLI)
New:
checkpoint_v1.json = deterministic projection of rollout + small validated semantic updates
Used at runtime:
view_v1(checkpoint, caps) -> injected as a "SESSION_CHECKPOINT v1" message after compaction/resume
ASCII diagram:
rollout.jsonl (events) ──► reduce() ──► checkpoint_v1.json ──► view_v1(caps) ──► LLM context
^ already exists ^ new, tiny ^ stable text
5. Data Model (Checkpoint v1)
We keep a single JSON file as the "single source of truth" after compaction.
5.1 Types
type FactStatus = "VALID" | "SUSPECT"
type Artifact = {
uri: string // e.g. "src/auth.py"
kind: "file" | "tool_output" | "command"
hash?: string // if available
lastObservedSeq: number
}
type EvidenceRef = {
source: "user" | "file" | "tool_output"
ref: string // message id, file path, tool call id
hash?: string
}
type FactRecord = {
value: string
evidence: EvidenceRef
dependsOn: Array<{ uri: string, hash?: string }>
status: FactStatus
}
type DecisionRecord = {
decisionId: string
topic?: string
decision: string
rationale: string
supersedes?: string
evidence: EvidenceRef
}
type Plan = {
steps: Array<{ id: string, text: string }>
done: Record<string, boolean>
evidence?: EvidenceRef
}
type CheckpointV1 = {
schemaVersion: 1
seq: number // last applied seq
task: { text: string, evidence: EvidenceRef } | null
plan: Plan
decisions: DecisionRecord[]
artifacts: Record<string, Artifact> // key = uri
facts: Record<string, FactRecord>
recentArtifacts: string[] // URIs, stable bounded list
}5.2 Boundedness
Hard limits (example defaults):
maxFactsTotal = 64maxDecisionsTotal = 32maxPlanStepsTotal = 32maxRecentArtifacts = 16maxValueChars = 160(truncate deterministically)
Eviction is deterministic:
- facts: evict oldest by
lastTouchedSeq, tie-break by key lexicographic - decisions: evict oldest by seq
- recentArtifacts: keep most recent unique URIs, bounded
6. Event Sources (no new logging layer)
We reuse existing rollout JSONL (already written by CLI).
We derive:
- artifact observations from tool calls (host truth)
- task pointer from last user message
Optionally, we accept validated semantic updates from the model (facts/plan/decisions) via a strict JSON tool.
6.1 Host-owned updates (always)
From tool calls / file ops:
- artifact observed:
{uri, kind, hash?, seq} - update
recentArtifacts
Critical rule: hashes are computed/recorded by the host, not by the model.
6.2 Model-proposed semantic updates (optional MVP, but recommended)
Provide a tool:
memory.apply(payload: { kind: "plan"|"decision"|"fact", ... })
The reducer validates:
- must include evidence that references an existing artifact/tool output id or user message id
- must include
dependsOnfor facts (URIs), when applicable - cannot write "behavior policies" (see Security)
If invalid: reject tool call (no state update).
7. Staleness Derivation (no invalidate event)
Fact.status is derived:
A fact is VALID iff:
- for every dep in
fact.dependsOn:checkpoint.artifacts[dep.uri].hashexists- and matches
dep.hash(whendep.hashexists)
Otherwise SUSPECT.
If the current artifact hash is unknown → SUSPECT.
This guarantees no silent wrongness.
8. view_v1(checkpoint, caps) — deterministic, jitter-free
view_v1 must be:
- pure
- byte-for-byte stable for the same
(checkpoint, caps) - no token counting, no heuristics, no randomness
8.1 ViewCaps
type ViewCaps = {
maxOpenPlanSteps: number
maxDonePlanSteps: number
maxDecisions: number
maxFactsValid: number
maxFactsSuspect: number
maxRecentArtifacts: number
maxValueChars: number
}8.2 Output format (stable)
[SESSION_CHECKPOINT v1]
[TASK]
- ...
[PLAN]
- [ ] ... (id=...)
- [x] ... (id=...)
[RECENT_ARTIFACTS]
- file: src/auth.py (hash=...)
- cmd: pytest -q
- file: tests/test_auth.py (hash=...)
[DECISIONS]
- ... — ... (id=... supersedes=... evidence=source:ref)
[FACTS_VALID]
- key: value (evidence=source:ref deps=n)
[FACTS_SUSPECT]
- key: value (why=SUSPECT dep=uri)
Selection ordering rules are fixed (lexicographic for facts; stable plan order; decisions last non-superseded).
Truncation is deterministic:
- if
value > maxValueChars=> take first(maxValueChars-1)+ "…"
9. Compaction Behavior Change
Current compaction tries to summarize conversation.
New behavior:
- Persist
checkpoint_v1.json(projection). - Start a fresh model context and inject:
- standard base instructions / AGENTS.md (unchanged)
- a fixed short "how to use checkpoint" instruction
view_v1(checkpoint, caps)
No LLM call is required to generate the checkpoint.
This also makes /compact usable even at 100% context usage.
10. Resume Behavior
On resume (from rollout JSONL):
- load
checkpoint_v1.jsonif present (or rebuild from rollout) - inject
view_v1similarly - bias toward completing open plan steps
11. Security / Integrity
- S1. Actor constraints:
- Only "user/system" can set task pointer.
- Only host can set artifact hashes/observations.
- S2. Facts/decisions are NOT "behavior policies":
- Reducer rejects updates that attempt to store normative "always do X" agent policies.
- S3. Evidence gating:
- facts derived from file/tool output must reference stable anchors (path/tool-call id).
- S4. Avoid persistent injection:
- tool outputs are never treated as instructions; they are only evidence.
12. Metrics / Acceptance Criteria
- A1. After compaction, unchanged artifacts do not get re-read purely to "remember what we learned".
- A2. Compaction always reduces context usage to a predictable bounded range.
- A3. Staleness flips facts to
SUSPECTautomatically on hash mismatch/unknown. - A4.
view_v1is deterministic and unit-tested (golden tests). - A5. Checkpoint is bounded on disk and stable across platforms.
13. Implementation Plan (incremental)
Phase 0 (1 PR):
- Write
checkpoint_v1.jsonprojection:- task pointer from last user message
- artifacts + recentArtifacts from tool calls / file ops
- Implement
view_v1+ inject on compaction/resume - No model semantic updates yet
Phase 1:
- Add
memory.applytool for plan/decision/fact updates with strict validation - Add staleness derivation +
SUSPECTrendering
Phase 2:
- Drift guardrails (optional): if next action edits unrelated file, ask confirmation
- UX:
/checkpoint show(optional)
14. Open Questions
- Q1. Where to store
checkpoint_v1.json(same folder as rollout? recommended). - Q2. Hashing strategy: cheap vs strong (git blob hash, content hash, or mtime+size fallback).
- Q3. Default caps per model/tool-output limits.
Related Issues
- Support Task-Scoped Context with Session-Level Code Container to Reduce Redundant Token Usage #6102 — task-scoped context containers
- Control over auto-compaction parameters #4106 — control over auto-compaction parameters
- Proposal: Implement intelligent context selection for improved code generation #668 — intelligent context selection (closed, but related)
- Auto compaction causes GPT-5-Codex to lose the plot. It forgets it is mid-task, forgets it has edited files and stops. #5957 — memento summary loses tool call history
- /compact does not work. #4813, /compact doesn't correctly optimize the context (v0.63.0) #7232 — compaction fails or doesn't help
- Context compaction loop leaves ~5% context and stalls session (eventually 0% / interrupted) #8365, Codex agent is stuck in compaction loop #8481, Context compaction stalls and the session hangs after large tool outputs (codex-cli 0.74.0) #8402 — compaction loop/hangs
- Bug report: Session resume after rate limit loses task intent and continues on wrong context #8310 — resume loses task intent
Additional information
I'm willing to implement Phase 0 (checkpoint_v1.json projection + view_v1 + inject on compaction/resume) as a single PR. Can deliver working code with golden tests in a few days. Happy to iterate based on maintainer feedback.