[pull] main from openai:main#58
Open
pull[bot] wants to merge 3510 commits into
Open
Conversation
## Summary - Restore yielded output when an observation receiver disappears before delivery. - Preserve pending-frontier output and tool IDs across failed delivery. - Add dropped-observer coverage for yield and pending observations. ## Why Canceling a wait must not consume output or a pending frontier that the caller never received. ## Impact A later observation can recover undelivered incremental output without duplication. ## Validation - Stack-tip validation: `just test -p codex-code-mode -p codex-code-mode-protocol` (70 passed). - Parent branch: `cconger/code-mode-runtime-compact-03e-shutdown-hierarchy`.
## Summary - Retain the first pre-observation `yield_control()` boundary when a cell completes before observation. - Deliver the preserved yield before the buffered completion. - Keep later unattached yields as no-ops. ## Why Create followed by the initial wait must preserve the former execute response boundary even when the script runs to completion first. ## Impact The first wait observes the same initial yield boundary as before create and observe were decoupled. ## Validation - Focused initial-yield signature regression passed. - Stack-tip validation: `just test -p codex-code-mode -p codex-code-mode-protocol` (70 passed). - Parent branch: `cconger/code-mode-runtime-compact-03e2-observation-delivery`.
## Summary - add a default-on `auto_compaction` feature flag as an internal escape hatch - skip pre-turn, model-switch/hash, and mid-turn automatic compaction when the flag is disabled - preserve manual `/compact` behavior and surface the existing context-window error when the provider runs out of room - add integration coverage for disabled pre-turn and mid-turn compaction ## Motivation Long-running SPO optimization rollouts need the option to preserve their full context and fail on context exhaustion instead of entering another compaction window. This deliberately uses the existing feature-flag mechanism rather than adding a dedicated public config or app-server API. Disable it with: ```sh codex --disable auto_compaction ``` ## Testing - `just test -p codex-features` — 51 passed - `just test -p codex-core auto_compaction_feature_disabled` — 2 passed - `just fix -p codex-core -p codex-features` - `just write-config-schema` - `just test -p codex-core` — the new compaction tests passed; the overall local run had 54 unrelated environment failures, primarily missing first-party test binaries and shell-snapshot timeouts
Responses API safety buffering metadata currently stops at the transport boundary, so app-server clients cannot render the in-progress safety review state. This change: - decodes and deduplicates `safety_buffering` metadata from Responses API SSE and WebSocket events without suppressing the original response event - emits a typed core event containing the requested model plus backend use cases and reasons - forwards that event as `turn/safetyBuffering/updated` through app-server v2 and updates generated protocol schemas - keeps the side-channel event out of persisted rollouts and turn timing This supports the Codex Apps buffering UX and depends on the Responses API backend work in openai/openai#1044569 and openai/openai#1044571. Validation: - focused `codex-core` safety-buffering integration test passes - `cargo check -p codex-core -p codex-app-server -p codex-app-server-protocol` - `just fix -p codex-api -p codex-protocol -p codex-core -p codex-app-server-protocol -p codex-app-server -p codex-rollout -p codex-rollout-trace -p codex-otel` - `just fmt` - broad package test run: 4,430/4,492 passed; 62 unrelated local-environment/concurrency failures involved unavailable test binaries, MCP subprocess setup, and app-server timeouts
## Summary - read the `AutoCompaction` feature flag through `TurnContext::config` - fix both the mid-turn and pre-sampling compaction checks ## Why #28260 was validated against an older base where `TurnContext` exposed a direct `features` field. It was then merged after that field had moved under `config`, leaving the merge result unable to compile with `E0609` on `turn_context.features`. This restores compilation for Bazel, SDK, and argument-comment-lint jobs that build `codex-core`. Behavior is unchanged: disabling `auto_compaction` still skips automatic compaction. ## Validation - `just fmt` - `CODEX_HOME=/private/tmp/codex-fix-auto-compaction-test-home just test -p codex-core auto_compaction_feature_disabled` — 4 passed - `just test -p codex-core` — `codex-core` compiled; 2,722 passed and 89 unrelated local-environment failures remained because the sandbox could not write the default Codex SQLite/proxy paths and some first-party test binaries were unavailable
## Summary
A cold-resumed subagent kept its durable thread ID but could receive a
new session ID, splitting one agent tree across multiple sessions after
a restart.
Persist the root session ID in every rollout `SessionMeta`, carry it
through thread creation, and restore it before initializing the resumed
`Session` and `AgentControl`.
## Behavior
For a nested agent tree:
```text
root session R
parent thread P
child thread C
```
The child rollout stores:
```text
session_id: R
parent_thread_id: P
id: C
```
After a cold resume, the child still belongs to root session `R` while
its immediate parent remains `P`. The integration coverage uses distinct
values for all three IDs so it catches restoring the session from
`parent_thread_id`.
## Legacy rollouts
Previous rollouts have `id` but no `session_id`. `SessionMetaLine`
deserialization treats a missing `session_id` as `id`, keeping those
files readable, listable, and resumable. When a legacy subagent is
resumed through its root, that synthesized child ID no longer overrides
the inherited root-scoped `AgentControl`. New rollouts always persist
the explicit root session ID.
## Why Multi-agent delegation policy was split across `multiAgentMode`, `features.multi_agent_mode`, and `usage_hint_enabled`. These controls could disagree: a requested mode could be downgraded by the feature flag, and disabling usage hints also disabled mode instructions. Some clients also need multi-agent tools without adding delegation-policy text to model context. The previous two-mode API could not express that directly. ## What changed `multiAgentMode` is now the only live delegation-policy control: | Mode | Behavior | | --- | --- | | `none` | Keep multi-agent tools available without adding mode instructions. | | `explicitRequestOnly` | Only delegate after an explicit user request. | | `proactive` | Delegate when parallel work materially improves speed or quality. | - new threads default to `explicitRequestOnly`; omitting the mode on later turns keeps the current value - thread start, resume, fork, and settings responses always report the concrete current mode instead of `null` - mode selection remains sticky across turns and resume - usage-hint text no longer controls whether mode instructions apply - `features.multi_agent_mode` and `usage_hint_enabled` remain accepted as ignored compatibility settings so existing configs continue to load - app-server documentation and generated schemas describe the three-mode API ## Tests - `just test -p codex-core multi_agent_mode` - `just test -p codex-core multi_agent_v2_config_from_feature_table` - `just test -p codex-core spawn_agent_description` - `just test -p codex-features` - `just test -p codex-app-server-protocol` - `just test -p codex-app-server multi_agent_mode`
## Why PR #29108 lets the orchestrator send sandbox intent with `process/start` without wrapping the command for its own operating system. This PR completes that boundary by making the executor interpret and enforce the intent using its own filesystem paths and sandbox implementation. For example, a macOS TUI targeting a Linux devbox sends `/bin/bash -lc pwd`. The Linux executor turns that into its own `codex-linux-sandbox ... /bin/bash -lc pwd` launch. ## What changes - Keep `process/start` unchanged when no sandbox intent is present. - Convert sandbox `PathUri` values into native paths on the executor. - Bind symbolic `:workspace_roots` permissions to the executor's native sandbox cwd. - Select the sandbox implementation on the executor and wrap the original command immediately before spawning it. - Reject sandbox-required execution before spawning when the executor cannot enforce the intent. - Pass exec-server runtime paths into process creation so Linux can locate `codex-linux-sandbox`. The boundary is therefore: ```text orchestrator executor original argv + sandbox intent -> select and enforce local sandbox ``` This PR intentionally treats a denied remote command as an ordinary command failure. Draft follow-up #29424 carries a semantic `sandboxDenied` result back to unified exec for the existing approval and retry flow. ## Platform scope Linux and macOS use their existing direct-spawn sandbox transforms. Windows sandboxed remote process launch is intentionally unsupported in this PR. The current Windows direct-spawn wrapper does not correctly preserve arbitrary argv, TTY behavior, or pass the full child environment out of band. The executor rejects the request instead of running it incorrectly or unsandboxed. ## Known follow-ups - The transported permission profile can still contain orchestrator-materialized helper or explicit paths. A `TODO(jif)` marks where the executor boundary should receive pre-host-materialization permission intent. - The sandbox wrapper currently replaces a requested custom inner `arg0`. A `TODO(jif)` marks where this must be preserved or rejected explicitly. - Draft PR #29424 contains the deferred sandbox-denial classification and approval/retry behavior. ## Rollout assumption This executor-sandbox stack is unreleased and its client and executor are expected to move together. This PR does not add mixed-version negotiation with older exec servers.
## Summary - Add backend-client types and fetch support for active workspace messages. - Add the app-server v2 `account/workspaceMessages/read` method, generated schemas, and README documentation. - Delegate workspace-message eligibility to the Codex backend feature gate; map a backend 404 to `featureEnabled: false`. ## Testing - `just write-app-server-schema` - `just test -p codex-backend-client` - `just test -p codex-app-server-protocol` - `just test -p codex-app-server workspace_messages` - `just fix -p codex-backend-client -p codex-app-server-protocol -p codex-app-server` - `just fmt` ## Stack - Base PR for #28232, which adds the TUI status-line integration.
## Why Every successful Responses WebSocket event currently produces three local log records: the full payload at TRACE, an OpenTelemetry log event, and an OpenTelemetry trace event. On busy threads these records fill the 1,000-row log partition in seconds and cause continuous SQLite insert-and-prune churn. Related to https://openai.slack.com/archives/C095U48JNL9/p1782128972644209 ## What changed - Stop logging each successful Responses WebSocket payload at TRACE. - Stop emitting `codex.websocket_event` as OpenTelemetry log and trace events. - Keep WebSocket event counters, duration metrics, response timing metrics, parsing, and error handling.
## Why Nonblocking environment snapshots allow a turn to reach the model while a remote environment is still starting. The initial context can describe that environment as still loading, but nothing currently refreshes the model-visible environment context when startup finishes during the same turn. This adds the first request-scoped reconciliation slice on top of #28683. It is gated by `DeferredExecutor` and intentionally updates only model-visible environment context; tools and other environment-derived state will migrate separately. ## What - Add a minimal `StepContext` containing the environment snapshot captured before each sampling request. - Render attached environments with their resolved shell and starting environments with `still loading`. - Track the latest environment state recorded in model history and append a bounded update only when it changes. - Seed that baseline from full initial context so ready-at-start environments are not duplicated. - Clear the in-memory baseline when history is rewritten so replacement history can be refreshed safely. ## Testing - `just test -p codex-core deferred_executor` - `just test -p codex-core environment_context_baseline_deduplicates_until_history_is_replaced` The integration coverage verifies that a pending environment reaches the first request, the ready state reaches the next request, later requests do not duplicate it, and ready-at-start environments remain single-injected. <details> <summary>Live verification</summary> - Connected to a real remote executor with startup deliberately delayed and forced three sampling requests in one turn. - Inspected the raw model inputs: request 1 showed the remote environment as `still loading`, request 2 appended its ready shell and cwd, and request 3 contained no duplicate ready update. - With the feature disabled, startup waited for the delayed executor and the first request contained only the ready environment. - With a synchronously ready environment and the feature enabled, the first request contained one environment context with no duplicate. - Executed `pwd` and read a marker file through the remote process runner; the command exited successfully and returned the remote cwd and marker contents. </details>
## Description Restore `thread_source` in `x-codex-turn-metadata`. Inadvertently removed `thread_source` from `x-codex-turn-metadata` in #27122 - didn't realize it was a top-level thread app-server API field, not passed in `responsesapi_client_metadata`. This also reserves the key so `responsesapi_client_metadata` cannot override it.
## Why The local SQLite log sink currently enables TRACE for every target. This persists high-volume dependency logs bridged through `target=log` and duplicates OpenTelemetry mirror events in `codex_otel.log_only` and `codex_otel.trace_safe`. These records rapidly consume the per-partition log budget and cause unnecessary SQLite insert-and-prune churn. ## What changed - Keep TRACE persistence for other targets. - Exclude bridged `target=log` events from the SQLite sink. - Exclude the two `codex_otel` mirror targets from the SQLite sink. - Share the same filter between app-server and TUI. Remote OpenTelemetry export and metrics are unchanged.
## What - make Fjord's centralized response-item image preparation unconditional for new and resumed history - have local user images and `view_image` outputs always defer decoding and resizing to that path - retain `resize_all_images` as an ignored, removed compatibility key for released clients - delete the flag-off producer paths and obsolete policy-specific tests ## Why Centralized preparation is now the intended image path. Keeping the runtime feature checks also kept two image-processing implementations alive and allowed client config to select the legacy behavior. This is a clean replacement for #28975, rebuilt from the latest `main`. ## How `prepare_response_items` now runs whenever items enter history and whenever persisted history is reconstructed. Producers emit deferred image data, so malformed images become the existing model-visible placeholder instead of failing the session at the producer. ## Test plan - `just fmt` - `just fix -p codex-core -p codex-features` - `just test -p codex-features` — 52 passed - focused affected `codex-core` set — 20 passed - `just test -p codex-core handle_accepts_explicit_high_detail` — 1 passed - full `just test -p codex-core` attempt — 2,723 passed; 88 unrelated environment failures from read-only `~/.codex` SQLite state and unavailable integration helper binaries
The custom Windows argument-comment-lint job was temporarily moved to `windows-2022` in #28940 after hermetic LLVM source extraction failed on the newer runner. This takes the upstream extraction fix so the job can return to the intended custom runner. This upgrades `llvm` to `0.7.9` and `rules_cc` to `0.2.18`, refreshes the module lock, rebases the remaining Windows and custom libc++ patches, drops the obsolete symlink-extraction workaround, and restores the `windows-x64` runner configuration. Validation: - Verified all LLVM patches apply cleanly against the `0.7.9` source. - Built `@llvm-project//compiler-rt:clang_rt.builtins.static`.
This PR moves construction of `PluginTelemetryMetadata` from loader and
model helpers into `PluginsManager`, which already owns installed plugin
state and will eventually perform remote identity enrichment. The
metadata type remains in `codex-plugin`, and serialized analytics events
remain unchanged.
## Before
```mermaid
flowchart LR
subgraph Events["Analytics event paths"]
direction TB
Lifecycle["Local install / uninstall"]
Config["Enable / disable"]
Remote["Remote install"]
Used["Plugin used"]
end
subgraph Construction["Metadata construction"]
direction TB
Loader["Loader telemetry helpers"]
Summary["PluginCapabilitySummary::telemetry_metadata"]
Override["Caller adds remote_plugin_id"]
end
Metadata["PluginTelemetryMetadata"]
Lifecycle --> Loader
Config --> Loader
Remote --> Loader
Loader -->|"local events"| Metadata
Loader -->|"remote install"| Override
Override --> Metadata
Used --> Summary
Summary --> Metadata
```
Telemetry metadata was constructed through loader helpers, a
capability-summary method, and a remote-install call-site override.
## After
```mermaid
flowchart LR
subgraph Events["Analytics event paths"]
direction TB
Lifecycle["Local install / uninstall"]
Config["Enable / disable"]
Remote["Remote install"]
Used["Plugin used"]
end
Manager["PluginsManager — single construction owner"]
Metadata["PluginTelemetryMetadata"]
Lifecycle --> Manager
Config --> Manager
Remote -->|"authoritative remote ID"| Manager
Used -->|"capability summary"| Manager
Manager --> Metadata
```
Every analytics path delegates metadata construction to
`PluginsManager`. Remote install still supplies its authoritative
backend ID explicitly.
## What Changes
- Make loader code return a focused plugin capability summary instead of
constructing analytics metadata.
- Centralize immutable plugin telemetry metadata construction in
`PluginsManager`.
- Route local install/uninstall, remote install, enable/disable, and
plugin-used emitters through the manager.
- Preserve the current serialized analytics contract exactly.
Normal metadata still has no remote override. Remote install continues
to provide its authoritative backend ID explicitly, so the existing
serializer continues reporting that ID through `plugin_id`.
Snapshot-based enrichment is intentionally deferred to the final PR.
## Testing
- `just test -p codex-core-plugins` (238 tests passed)
- `just test -p codex-plugin` (3 tests passed)
- Scoped Clippy/compile checks passed for `codex-plugin`,
`codex-core-plugins`, `codex-app-server`, and `codex-core`.
## Split Overview
```text
main
├── #27093 Debug analytics capture (merged)
├── #27099 Non-mutating plugin smoke (merged)
├── #27100 Remote install/uninstall smoke (merged)
└── #27102 Plugin telemetry metadata refactor ← you are here
└── #27669 Persist remote plugin identity
After #27102 and #27669 merge:
└── Final PR: add explicit local and remote IDs to plugin analytics
```
Review order and dependencies:
1. [#27093 Add debug-only analytics event
capture](#27093) (merged)
2. [#27099 Add a plugin analytics smoke
workflow](#27099) (merged)
3. [#27100 Add a remote plugin analytics mutation smoke
workflow](#27100) (merged)
4. This metadata refactor, independent and based on `main`
5. [#27669 Persist remote plugin
identity](#27669), stacked on this
PR
6. Final remote-ID behavior PR, created after the prerequisites merge
The original [#26281](#26281)
remains open as the aggregate reference until the final replacement PR
is published.
## Summary [#26701](#26701) added remote plugin identity support, [#26702](#26702) added remote-section fetching and state, and [#28768](#28768) extracted the catalog rendering module. This PR builds the product-facing `/plugins` catalog on that foundation so remote records appear as OpenAI Curated, Workspace, and Shared with me sections rather than backend marketplace implementation details. Plugin details remain read-only for sharing metadata. This PR does not add share-authoring actions or change the app-server protocol. ## Changes - Renders OpenAI Curated, Workspace, and Shared with me sections with loading, empty, and error states. - Preserves section selection and stable tab ordering as remote sections transition between fallback and populated states. - Shows OpenAI Curated loading only when the explicit vertical fallback request was issued. - Centralizes remote marketplace identity matching around the existing marketplace constants. - Uses product labels for remote marketplaces and identifies the personal marketplace as Local by its path. - Shows read-only source, authentication, version, and sharing metadata in plugin detail views. - Applies narrow display deduplication for local and remote records sharing a remote plugin ID: - installed records take precedence; - local mapped sources are preferred for details only when their installed state matches the selected record. - Returns from detail and confirmation views through the current plugin cache so newly loaded remote sections are not overwritten by an older captured response. - Keeps admin-disabled plugins view-only and labels default-installed plugins as Available by default. ## Tests New tests: - `plugins_popup_admin_disabled_available_plugin_has_view_only_hint` - `plugins_popup_remote_section_fallback_states_snapshot` - `plugins_popup_installed_remote_row_keeps_remote_detail_when_local_share_is_uninstalled` Updated existing plugin catalog tests and snapshots for product labels, detail metadata, personal-marketplace labeling, and stable tab ordering. Verification: - `cargo clippy -p codex-tui --all-targets -- -D warnings` ## Follow-ups - Local/remote duplicate normalization should eventually move into app-server. This PR intentionally keeps the compatibility behavior narrow and display-only. - PR5 will sanitize sensitive components before displaying Git source URLs.
## Why #29113 moved remote sandbox setup and enforcement to the exec server. That gives the executor ownership of the platform-specific work: a Linux executor chooses and runs a Linux sandbox even when the Codex orchestrator is running on macOS or Windows. It also means the orchestrator no longer knows which concrete sandbox the executor selected. When that sandbox blocks a remote command, the orchestrator currently sees only a failed process and can treat the denial as an ordinary command failure. The existing sandbox approval and retry path is then skipped. This PR lets the executor report one portable fact: > This command probably failed because the executor sandbox blocked it. The executor keeps its concrete sandbox type private. The protocol sends only the semantic result. ## Example Suppose a local macOS Codex session asks a Linux devbox to write outside the allowed workspace. Before this PR: ```text Linux sandbox blocks the write -> remote process exits with "Permission denied" -> local orchestrator sees an ordinary command failure -> the normal sandbox approval and retry path can be skipped ``` With this PR: ```text Linux sandbox blocks the write -> executor reports sandboxDenied: true -> unified exec returns UnifiedExecError::SandboxDenied -> the existing approval prompt is shown -> an approved retry runs through the existing unsandboxed retry path ``` ## What changes ### The executor remembers its selected sandbox The prepared remote process now retains the executor-selected `SandboxType`. This value never crosses the executor boundary. Commands started without a sandbox retain `SandboxType::None` and are never reported as sandbox denials. ### The executor uses the existing denial heuristic The existing local denial heuristic moves from `codex-core` into the shared `codex-sandboxing` crate. When a sandboxed remote process exits, the executor: 1. waits the same short output grace period used by local unified exec; 2. reads the output currently available in the existing retained output buffer; 3. runs the existing heuristic using the exit code and common denial messages; 4. stores the yes/no result before publishing the process exit. This deliberately matches the old local unified-exec behavior. It does not add a new streaming classifier, another output buffer, or stronger output-retention guarantees. ### The protocol reports a portable boolean `process/read` gains `sandboxDenied`: ```json { "exited": true, "exitCode": 1, "closed": false, "sandboxDenied": true } ``` The field defaults to `false` when an older executor omits it. The response does not expose the executor sandbox implementation or executor-native paths. ### Unified exec uses the existing error path The exec-server client carries `sandboxDenied` into the unified process state. If it is true, unified exec returns the existing `SandboxDenied` error instead of trying to classify remote output using an orchestrator-side sandbox type. Remote process exit remains visible as soon as the process exits. This PR does not wait for stdout or stderr to close and does not change the existing process lifecycle. ## Scope This PR is intentionally limited to matching the existing local unified-exec behavior for the initial command execution path. It does not add: - incremental denial tracking across the full output stream; - new denial handling for commands completed later through `write_stdin`; - new guarantees for preserving the semantic flag during the narrow reconnect-recovery race. Those can be considered separately if the same behavior is added for local execution. ## Test coverage One remote end-to-end integration test covers the complete intended flow: ```text remote read-only sandbox -> denied write -> executor reports the denial -> Codex requests approval -> user approves -> retry succeeds on the remote executor ``` Existing lifecycle coverage continues to verify that remote process exit is reported before late output streams close.
…28968) ## Description This PR cuts Codex over from generic `ResponseItem.metadata` (introduced here: #28355) to `ResponseItem.internal_chat_message_metadata_passthrough`, which is the blessed path and has strongly-typed keys. For now we have to drop this MAv2 usage of `metadata`: #28561 until we figure out where that should live.
## Summary - use generated image data URLs in the Python SDK examples and notebook - document HTTP and HTTPS image URLs as deprecated and recommend `LocalImageInput` - replace the remote-URL integration test with data-URL coverage `ImageInput` remains available for data URLs. The SDK does not duplicate app-server URL validation. ## Testing - `uv run --frozen --no-sync ruff check --output-format=full .` - `uv run --frozen --no-sync ruff format --check .` - full Python SDK test suite with an isolated writable `CODEX_SQLITE_HOME` (119 passed, 38 skipped)
## Why The reset flow introduced in #28154 still describes earned reset credits as "rate-limit resets" and uses generic reset-scope copy. It can also retain a stale available-credit count after redemption or an account change, leaving the reset action enabled after the last credit is used. This follow-up updates terminology only within that reset feature. Existing rate-limit wording elsewhere in the CLI and TUI is unchanged. ## What changed - Rename reset-specific `/usage` menu items, startup hints, and reset dialogs to "usage limit reset." - Describe monthly resets for Free, Go, and accounts that report a monthly usage window; otherwise describe the current 5-hour and weekly limits. - Recheck a cached zero balance when `/usage` is reopened, and refresh the balance after redemption so the final reset immediately disables the action. - Correlate async refresh results before updating snapshots and clear account-derived reset state, warnings, prompts, and status surfaces when the account changes. ## Validation - `just test -p codex-tui chatwidget::tests::usage` — 29 passed. - `just test -p codex-tui chatwidget::tests::status_command_tests` — 7 passed. - Account-boundary prompt and plan-mode prompt regression tests passed. - `cargo insta pending-snapshots` from `codex-rs/tui` — no pending snapshots.\ <img width="814" height="318" alt="image" src="https://github.com/user-attachments/assets/2a460e96-458b-4805-8d9f-c759382d21a4" /> view for monthly <img width="905" height="243" alt="image" src="https://github.com/user-attachments/assets/179f88e3-08fb-4af5-8dc6-ce6a944ed681" />
…ed (#27982) ## Why The first auto-review currently creates its Guardian child session on demand, adding avoidable latency before the review can begin. Creating the ordinary Guardian child during parent-session initialization lets that child use the existing session startup WebSocket prewarm before the first escalation. This does not introduce a Guardian-specific prewarm mechanism. ## What changed - initialize the existing Guardian review-session manager owned by `Session` when a thread starts with auto-review enabled and an approval policy that routes to Guardian - use the standard Guardian child-session construction and the existing session startup WebSocket prewarm - preserve the existing reuse-key invalidation and lazy creation fallback when startup initialization fails or the effective review configuration changes - add an integration test that verifies normal root-session startup emits a Guardian `generate=false` prewarm request ## Benchmark I compared release builds against main. Each prompt first ran a non-escalated `sleep 3`, then requested an escalated marker command. | binary | count | avg Guardian duration | median Guardian duration | avg Guardian TTFT | |---|---:|---:|---:|---:| | origin-main | 10 | 4008.7 ms | 3949.5 ms | 3746.5 ms | | session-fix | 10 | 2865.0 ms | 2594.0 ms | 2492.7 ms | Guardian duration fell by 28.5% and Guardian TTFT fell by 33.5%. These measurements cover Guardian review latency; they do not measure parent thread-start latency.
## Why `compile_scoped_filesystem_pattern()` accepted a `_policy_cwd` parameter even though scoped glob compilation no longer uses the policy working directory. Keeping that unused argument forced the surrounding permissions compilation path to keep forwarding `policy_cwd` through call sites that did not need it, making the API look more dependent on cwd resolution than it is. ## What changed Removed the unused cwd parameter from `compile_scoped_filesystem_pattern()` and the callers that only forwarded it: `compile_filesystem_permission()`, `compile_permission_profile()`, and `compile_permission_profile_selection()`. Workspace root resolution still keeps `policy_cwd`, because that path still resolves relative roots against the active policy cwd. Relevant code: [`codex-rs/core/src/config/permissions.rs`](https://github.com/openai/codex/blob/b8b9816102e064dae4488ec130cf560f63c1ab78/codex-rs/core/src/config/permissions.rs#L346). ## Verification - `just test -p codex-core config::permissions` - `just test -p codex-core` was also run after building `test_stdio_server`; it passed the touched permissions coverage but still reported unrelated existing failures in `cli_stream` and shell snapshot tests.
## Summary Stacked on #26706. Adds the shared auth/system-proxy contract that later platform resolver PRs plug into. This PR moves Codex-owned auth and startup HTTP clients through a common route-aware boundary, but does not yet add Windows or macOS system proxy resolution. The default path remains unchanged when `respect_system_proxy` is absent or disabled. ## Implementation - Adds `codex-client/src/outbound_proxy.rs` with the shared route-selection model: - `OutboundProxyConfig`; - `ClientRouteClass`; - `RouteFailureClass`; - `build_reqwest_client_for_route`. - Preserves the existing reqwest/default-client behavior when no route config is supplied. - Uses the fixed MVP routing policy when route config is supplied: platform system/PAC/WPAD discovery, then explicit env proxy variables, then direct connection. - Keeps platform-specific system discovery behind the shared client boundary. This PR provides the contract and fallback behavior; later resolver PRs plug in Windows and macOS discovery. - Adds `login::AuthRouteConfig` so auth call sites depend on a small policy type instead of platform resolver details. - Maps the resolved `Config.respect_system_proxy` boolean into `AuthRouteConfig` for auth-owned clients. - Wires the route config through browser login, device-code login, access-token login, login status, logout/revoke, token refresh, API-key exchange, app-server account login, TUI/app startup, cloud-config bootstrap, cloud tasks, plugin auth, and exec startup config loading. ## End-user behavior - No behavior changes by default. - When `respect_system_proxy = true`, auth-owned clients opt into the shared route-aware client path. - On platforms without a resolver implementation in this PR, system discovery is unavailable and the route-aware path falls back to explicit env proxy handling, then direct connection. - Custom CA handling remains separate from proxy route selection and still runs through the shared client builder. - No proxy URLs, PAC contents, or resolved platform details are exposed through the public config surface introduced here. ## Tests Adds or updates coverage for: - preserving default auth-client fallback behavior when no route config is provided; - injected environment-proxy fallback without mutating process environment; - existing login-server E2E flows using explicit `auth_route_config: None` to guard unchanged default behavior; - updated auth manager, login, logout, cloud-config, startup, and plugin-auth call sites passing route config explicitly.
# Summary Codex required every ChatGPT account to have an email address. A service-account personal access token can return valid account metadata without one, so PAT login failed while decoding the metadata response. This change makes email optional in the account metadata type that owns it and preserves that absence through authentication, provider account state, the app-server API, generated clients, and TUI bootstrap. Existing accounts with email addresses keep the same behavior. ## Behavior-changing call sites | Call site | Behavior after this change | | --- | --- | | `login/src/auth/personal_access_token.rs` | PAT metadata accepts a missing or null email and retains `None`. | | `agent-identity/src/lib.rs` | Agent Identity JWT claims accept an omitted email. | | `login/src/auth/storage.rs` and `login/src/auth/agent_identity.rs` | Stored and managed Agent Identity records carry `Option<String>`. Deserialization maps the legacy empty-string sentinel to `None`. | | `login/src/auth/manager.rs` | `get_account_email` returns the stored option, and managed identity bootstrap no longer converts `None` to an empty string. | | `model-provider/src/provider.rs` and `protocol/src/account.rs` | A ChatGPT provider account requires a plan type but may carry no email. | | `app-server-protocol/src/protocol/v2/account.rs` | `account/read` keeps the `email` field on the wire and returns `null` when the account has no email. Generated TypeScript and JSON schemas describe a required, nullable field. | | `sdk/python/src/openai_codex/generated/v2_all.py` | The generated Python `ChatgptAccount` model accepts `None` for email. | | `tui/src/app_server_session.rs` | Email-less ChatGPT accounts bootstrap normally, keep external feedback routing, omit account-email telemetry, and display the plan in account status. | ## Design decisions - Missing email remains `None` at every layer. The code never uses an empty string as a substitute. - The app-server response includes `"email": null` instead of omitting the field. Clients retain a stable response shape. - Plan type remains required for provider account state. This change relaxes only the email assumption. ## Testing Tests: affected test targets compile, scoped Clippy and formatting pass, a focused TUI snapshot covers plan-only account status, real before/after PAT login smoke covers metadata without email, app-server smoke covers `account/read` with `email: null`, and a regression smoke covers an existing email-bearing PAT. Unit tests run in CI. ## Evidence Visual smoke evidence will be attached here.
## Summary
Instead of:
reminder_interval_tokens = 65_536
allow users to configure explicit remaining-token reminder thresholds:
reminder_at_remaining_tokens = [65_536, 32_768, 16_384, 8_192, 4_096,
2_048, 1_024, 512]
## Validation
- CARGO_INCREMENTAL=0 just test -p codex-core rollout_budget: 9 passed
- just fix -p codex-core
- just fmt
## Why `permissionProfile/list` currently advertises every built-in and configured profile even when effective enterprise requirements prevent selecting it. That forces each client to reconstruct policy from lower-level requirement fields, which is easy to miss and difficult to keep consistent. The catalog should remain complete so clients can explain that an option was disabled by an administrator, while also reporting whether each profile is selectable. ## What - Add an `allowed` field to each permission profile summary. - Build a shared catalog from the effective config and current requirements, including `allowed_sandbox_modes`, `allowed_permissions`, and filesystem restrictions. - Use the shared catalog in app-server and the TUI so disallowed profiles remain visible but cannot be selected. - Use the canonical `:danger-full-access` profile ID in the TUI. - Update the app-server schemas, API documentation, behavioral tests, and TUI snapshots. ## Scope This PR targets `main` directly and is independent of #24852. It preserves the current behavior where built-in profiles are constrained by sandbox-mode requirements and `allowed_permissions` applies to configured profiles. ## Testing - `just test -p codex-core permission_profile_catalog_marks_profiles_disallowed_by_requirements` - `just test -p codex-app-server permission_profile_list` - `just test -p codex-app-server-protocol` - `just test -p codex-tui profile_permissions` - `just fix -p codex-core` - `just fix -p codex-app-server-protocol` - `just fix -p codex-app-server` - `just fix -p codex-tui` - `just fmt` --------- Co-authored-by: Codex <noreply@openai.com> Co-authored-by: Joey Trasatti <joey.trasatti@openai.com>
## Summary - initialize `selected_capability_roots` in the new `attach_in_memory_thread_store` test helper - restore `codex-core` test compilation on `main` ## Root cause [#30144](#30144) added the helper from commit `0c3d0742`, whose parent was `c38b2e9b`. That branch was based before [#29856](#29856) added `selected_capability_roots` as a required field on `CreateThreadParams`. The PR's Rust and Bazel workflows both passed against the stale branch head `0c3d0742`. When #30144 was squashed onto newer `main`, its initializer was integrated alongside the required field from #29856, producing `E0063` in `core/src/session/tests.rs`. Because those workflows tested the branch head rather than the integrated merge result, they did not see the version-skew failure before merge. ## Impact Any job that compiles the `codex-core` library tests fails, which turned the main-branch `rust-ci-full` and `Bazel` workflows red across platforms and blocks unrelated focused core tests. This change only completes the test initializer; it does not alter production behavior or workflow configuration. ## Validation - `just fmt` - `just test -p codex-core turn_complete_flushes_terminal_event_after_delivery` (1 passed, 2909 skipped) - `git diff --check`
## Why
MCP runtime reuse was keyed by every ready selected-capability
environment, even when an environment contributed no MCP servers or
connectors.
For example:
1. a global stdio MCP is running;
2. a selected remote environment contains only a skill;
3. that environment becomes ready;
4. the MCP and connector projection stays exactly the same;
5. Codex nevertheless rebuilds the MCP manager and restarts the global
stdio process.
That restart can interrupt active calls and discard process-local state
even though nothing about MCP changed.
## What changes
When selected-environment availability changes, Codex now resolves the
candidate MCP and connector projection before deciding whether to
replace the runtime:
- if the winning MCP servers or their ownership change, rebuild as
before;
- if the selected connector snapshot changes, rebuild as before;
- if an enabled MCP is explicitly bound to an environment whose
availability changed, rebuild as before;
- otherwise, keep the exact live manager and processes, and update only
the availability input remembered by the snapshot.
```text
ready selected environments: [] -> [skills-env]
resolved MCP servers: {global_probe} -> {global_probe}
resolved connectors: {} -> {}
result: reuse manager; keep the same process
```
The comparison uses the resolved winning servers and their sources, so
plugin/config ownership remains part of the runtime identity.
## Existing stack coverage
The integration PR directly below this one already covers both rebuild
boundaries: a selected MCP becomes callable and a selected connector
tool becomes model-visible when their environment becomes available. It
also verifies that an unchanged selected MCP runtime keeps its process.
This PR does not add another remote-attachment integration scenario for
the no-change optimization. `environment/add` returns before readiness,
and app-server does not currently expose a deterministic readiness
signal for an environment that contributes only skills. Keeping a
fixed-delay test would add flake risk; adding a new readiness API would
be outside this fix.
## Scope and assumptions
- This does not change skill discovery, World State rendering, or plugin
metadata caching.
- This does not add file watching or hot reload behavior.
- This does not change disconnect/reconnect handling.
- Selected environment IDs and their capability contents retain the
stack's existing stability assumption.
- Delayed `required = true` executor MCP behavior remains out of scope.
## Why
The selected-capability integration test already covers initial
attachment and cold resume, but it resumes while the selected executor
is still reachable.
That leaves an important World State transition untested: a thread
remembers its selected capability root, resumes while that environment
is unavailable, and later sees the same stable environment return.
## What this tests
This extends the existing end-to-end scenario:
```text
selected executor available
↓
app-server stops and the executor goes away
↓
thread resumes with the executor unavailable
↓
skills, selected MCP tools, and connector attribution are absent
↓
the same environment ID is attached again
↓
skills, MCP tools, and connector attribution return
```
The test also checks that the unavailable snapshot explicitly tells the
model that no selected-environment skills are currently available. After
reattachment, it invokes the selected skill again and verifies that a
new executor-owned MCP process starts.
## Scope
This is test-only. It keeps the existing assumption that an environment
ID refers to stable capability contents. It does not add package-file
invalidation or live transport reconnect behavior.
## Summary - stop publicly re-exporting the internally used `SKILLS_INTRO_WITH_ALIASES` constant - keep the constant and all skills rendering behavior unchanged - preserve every integration helper, API, fixture, assertion, and module used by tests ## Scope guardrails This revision keeps all remote/network-facing functionality and every line introduced by `jif <jif@openai.com>`. Following the test-preservation audit, it also restores the in-process RMCP test transport, the original `codex-mcp` fixture, `PluginLoadOutcome::effective_skill_roots` and its assertions, the `EffectiveSkillRoots` API family, the test-only apps renderer, and the TUI dead-code annotation. Those files now match the PR base exactly. No test imports or directly references the remaining public skills export being narrowed. ## Validation - repository-wide test-reference audit: no test-used code remains deleted or narrowed - deleted-line `git blame` audit: zero Jif-authored deletions - `cargo test -p codex-core-plugins -p codex-mcp -p codex-rmcp-client --lib`: 467 passed - `cargo test -p codex-core --lib apps::render`: 2 passed - `cargo test -p codex-core-skills --lib render::tests`: 19 passed - `cargo check -p codex-core-skills --all-targets`: passed - `just fix -p codex-core-skills`: passed - `just fmt`: passed - `git diff --check`: passed The full local `codex-core-skills` suite passed 106/108 tests; two loader tests detected an ambient repository skills root outside the package and failed their isolation assertions. The scoped renderer suite and all-target compile pass, and CI runs in an isolated environment. Final code delta: 1 insertion, 2 deletions across 2 files.
## Summary - Allow a top-level `description` string in `hooks.json`. - Continue rejecting unknown top-level keys and root-level hook events; events must remain under `hooks`. ## Testing - `just test -p codex-config`
## Description This PR adds a new `historyMode = "legacy" | "paginated"` to `Thread`. This will be stored in `SessionMeta` in the JSONL rollout file and as a new column in the SQLite thread_metadata table, and exposed on `thread/start` and on the `Thread` object in app-server. ## What changed - Added canonical `ThreadHistoryMode` with `legacy` and `paginated`, defaulting old and new SessionMeta to `legacy`. - Carried `history_mode` through core session config, ThreadStore stored metadata, local/in-memory stores, rollout metadata extraction, and the existing SQLite `threads` table. - Added experimental `historyMode` to app-server v2 `Thread` and `thread/start`. - Made paginated stored threads metadata-discoverable but unsupported for legacy full-history reads, `load_history`, live resume, and create paths. - Regenerated app-server schema fixtures and added protocol/state/thread-store/app-server coverage for persistence and fail-closed behavior. ## Compatibility floor Because users may be running various versions of Codex binaries on the same machine (TUI, Codex App, etc.), we will need to establish a compatibility floor for upcoming paginated threads, which will change how thread storage reads and writes work. The overall plan here: ``` Release N: - Add historyMode to SessionMeta / Thread / SQLite metadata. - Teach binaries to understand paginated threads. - If a binary sees `historyMode="paginated"` but does not support the paginated contract, it refuses to resume/mutate the thread. - Default remains `"legacy"`. Release N+1: - First-party clients start opting into paginated threads where appropriate. - Internal dogfood / staged rollout. - Measure old-client usage and paginated-thread unsupported errors. Release N+2: - Only after Release N+ is overwhelmingly deployed, make paginated the default. - Accept that a small tail of N-1-or-older binaries may not understand paginated threads. ``` The important behavior change is fail-closed handling for a binary that encounters a persisted `paginated` thread before it knows how to fully support paginated history. In app-server, if a thread is `paginated`, we will: - allow metadata-only discovery paths like `thread/list` and `thread/read(includeTurns=false)`, so clients can still see the thread and inspect its `historyMode` - reject legacy full-history/live-thread paths like `thread/read(includeTurns=true)` and `thread/resume` with an unsupported JSON-RPC error - avoid silently treating an unknown or future `historyMode` as `legacy` Under the hood, the ThreadStore layer also rejects legacy operations that would need to load or replay the full thread history for a paginated thread. That gives us the behavior we want for Release N: future paginated threads are visible, but this binary fails closed instead of trying to operate on them as if they were legacy threads.
Introduced by a merge race around thread.history_mode.
## Why Admins need persistent defaults for the model, reasoning effort, and service tier shown when the Desktop App creates a new thread. These are initialization defaults rather than runtime constraints: the App should use them to initialize its draft while still allowing a user to make an explicit selection. The app-server therefore needs to expose the managed values before thread creation without changing `thread/start` behavior for other clients. ## What changed - Parse `model`, `model_reasoning_effort`, and `service_tier` from `[models.new_thread]` in `requirements.toml`. - Compose the `models` requirements through the existing requirements-layer precedence rules. - Expose the resolved values through `configRequirements/read` as `requirements.models.newThread`. - Add the corresponding app-server protocol types and regenerate the JSON and TypeScript schema fixtures. - Document the new `configRequirements/read` fields in the app-server README. ## Scope This PR is data plumbing only. It does not apply these values during `thread/start` and does not change thread creation for existing app-server clients, resumed or forked sessions, internal or subagent sessions, `codex exec`, or the TUI. A companion Desktop App change owns draft initialization, sends the effective settings for ordinary and prewarmed starts, and preserves explicit user changes. ## Validation - Requirements deserialization coverage for `[models.new_thread]` - Requirements-layer precedence coverage - App-server API mapping coverage - `configRequirements/read` integration coverage - Regenerated app-server JSON and TypeScript schema fixtures
## Why
Environment skill discovery needs two independent pieces of information:
- plugin namespaces from `plugin.json` files; and
- skill metadata from each `SKILL.md` file.
Today these happen in sequence. Codex waits for every plugin namespace
lookup to finish before it starts reading any skill files. On a remote
executor, that creates an avoidable network-latency barrier.
```text
before: walk -> namespace lookups -> skill reads -> build catalog
after: walk -> namespace lookups ─┐
-> skill reads ───────┴-> build catalog
```
## What changes
- Read and parse skill files without waiting for plugin namespace
discovery.
- Resolve root and nested plugin namespaces concurrently.
- Join both results only when constructing the final qualified skill
names.
- Keep the existing 64-skill concurrency bound, output ordering,
warnings, metadata behavior, and namespace rules.
## Testing
The regression test makes plugin manifest lookup wait until a `SKILL.md`
read has started. The old serialized pipeline would time out; the new
pipeline completes and still returns the correctly namespaced skill.
`just test -p codex-core-skills` passes all 111 tests.
## Out of scope
This does not add an exec-server endpoint, batch filesystem calls, or
reduce the number of files transferred. A frontmatter-only read or
server-side skill catalog can remain a separate follow-up if benchmarks
show that transferred bytes are the next bottleneck.
Prompt update of MAv2 to include agents.md and skills more explicitly should mimic: #27919
## Why #29683 exposes managed defaults for new-thread model settings through `configRequirements/read` without applying them server-wide. The TUI is an app-server client, so it should explicitly consume those defaults when it creates a fresh thread. This lets plain `codex` start on the managed model while preserving the existing ability to change model settings within the thread. ## What changed - Read `requirements.models.newThread` during TUI app-server bootstrap. - Apply the managed model, reasoning effort, and service tier to the initial fresh thread and subsequent `/new` or `/clear` threads. - Keep explicit launch overrides above the managed defaults. - Normalize the managed `fast` service tier to the `priority` request value. - Leave resumed and forked threads unchanged. The application logic lives in a small TUI-only module; app-server `thread/start` behavior remains unchanged for other clients. ## User experience - Plain `codex` starts with the managed new-thread settings. - A user can still change settings with `/model` or the existing service-tier controls. - Starting another fresh thread reapplies the managed defaults. - Explicit launch choices such as `codex -m <model>` continue to win. ## Validation - `just test -p codex-tui managed_new_thread_defaults` - `just fix -p codex-tui` Depends on #29683.
## Description This PR makes `thread.history_mode` immutable after the thread's canonical first `SessionMeta` has been written. Later same-thread `SessionMeta` lines are compatibility metadata writes, not a new thread definition. Without this, an older binary could append a `SessionMeta` that omits `history_mode`; when a newer binary replays it, serde defaults that missing field to `legacy` and SQLite could downgrade a paginated thread. ## Why `history_mode` is the persisted thread storage contract. Paginated-thread fail-closed behavior and SQLite memory filtering depend on it staying aligned with canonical rollout metadata, especially when multiple Codex binary versions can touch the same local rollout. ## What changed - Stop generic rollout metadata replay from overwriting `history_mode` from later `SessionMeta` items. - Remove `history_mode` from `ThreadMetadataPatch`, so mutable metadata sync and app-server metadata updates cannot rewrite it. - When local metadata sync has to recreate a missing SQLite row, recover `history_mode` from the rollout's canonical first `SessionMeta` instead of from a mutable patch. - Keep the in-memory thread store using the created thread's canonical `history_mode` instead of metadata patches. - Fill the one remaining core test `CreateThreadParams` initializer with the new `history_mode` field; Bazel CI caught this after the parent history-mode PR landed. ## Validation - `just fmt` - `just test -p codex-thread-store` - `just test -p codex-state session_meta_does_not_set_model_or_reasoning_effort`
## Description This adds stable optional `turnId` support to `thread/fork`. When supplied, the fork copies persisted history through that terminal turn, inclusive, and drops later turns from the new thread. Omitting or passing `null` preserves the existing full-history fork behavior, including the interruption marker when the stored source history ends mid-turn. ## Why We're deprecating `thread/rollback` and this will help certain UX use cases work around it by using `thread/fork` + `turn_id` instead.
## Why I use the `$code-review` skill a lot and it'd be nice to add my own additional review criteria in `$CODEX_HOME/skills/code-review-*`. ## What Removes phrasing about "code-review-* skills in this repository" which in practice seems like enough to get Codex to consult my user-level code review skills in addition to the repo-level ones.
## Summary - add Sol (`openai.gpt-5.6-sol`), Terra (`openai.gpt-5.6-terra`), and Luna (`openai.gpt-5.6-luna`) to the Amazon Bedrock static model catalog - derive all three entries from the bundled GPT-5.5 metadata and add the Bedrock-only `max` reasoning effort - keep the new entries below the current GPT-5.5 and GPT-5.4 models at priorities 2, 3, and 4, preserving GPT-5.5 as the default - add deep-equality coverage for inherited model configuration, catalog ordering, context windows, and service-tier behavior
### Summary Release live thread persistence when a session ends because its submission channel closes. This prevents a later same-process resume from failing with `thread ... already has a live local writer`. ### Details The issue is in the `codex-core` session teardown path used by Codex hosts, rather than in Managed Agents API or exec-server itself. Explicit shutdown already closes the `LiveThread`, which releases the process-scoped writer held by `LocalThreadStore`. The submission-channel-close fallback ran runtime and extension teardown but skipped that persistence shutdown, leaving the thread ID registered as having a live writer. This change: - closes the `LiveThread` on the channel-close fallback path; - preserves the existing teardown order used by explicit shutdowns; - extends the lifecycle regression test to assert that the thread store receives `shutdown_thread`. Context: [original report](https://openai.slack.com/archives/C0B4NBHQGTV/p1782136364948039), [recent occurrence 1](https://openai.slack.com/archives/C0B4NBHQGTV/p1782434817895839?thread_ts=1782136364.948039&cid=C0B4NBHQGTV), [recent occurrence 2](https://openai.slack.com/archives/C0B4NBHQGTV/p1782335107474429?thread_ts=1782136364.948039&cid=C0B4NBHQGTV) ### Testing - `just test -p codex-core submission_loop_channel_close_runs_full_thread_teardown` - `just test -p codex-core --lib` (1,989 passed; 3 skipped) - `just fix -p codex-core` - `just fmt` - Native code review: no findings I also attempted `just test -p codex-core`. The new regression passed; 79 unrelated integration tests failed in the local harness, primarily because helper binaries such as `test_stdio_server` were unavailable, plus local proxy/shell timing failures.
## Summary - classify authentication-required RMCP startup failures, including errors nested inside `ClientInitializeError::TransportError` - let `codex-mcp` consume that classification so the existing `reauthenticationRequired` startup failure reason is emitted - add a regression test that performs real startup with an expired persisted OAuth token and no refresh token ## Why Follow-up to #29877. RMCP stores streamable HTTP initialization failures inside a dynamic transport error whose payload is not exposed through the standard Rust error source chain. The original `anyhow::Error::chain()` check therefore missed the nested `AuthError::AuthorizationRequired` seen during real MCP startup and emitted `failureReason: null`. The transport-specific inspection now lives in `codex-rmcp-client`, while `codex-mcp` consumes only the domain-level authentication-required result. This classifier does not distinguish first-time login from reauthentication; the existing auth-state logic remains responsible for that distinction. ## User impact When stored MCP OAuth credentials are expired and cannot be refreshed, app clients now receive `failureReason: "reauthenticationRequired"` on the failed startup update and can show the reconnect action. First-time login and unrelated startup failures remain unchanged. ## Validation - `just test -p codex-rmcp-client --test streamable_http_oauth_startup identifies_expired_unrefreshable_token_startup_error` - `just test -p codex-mcp startup_outcome_error_identifies_authentication_required` - `just test -p codex-mcp mcp_startup_failure_reason_requires_existing_oauth_and_auth_failure` - `cargo build -p codex-cli --bin codex` - local app-server probe emitted `failureReason: "reauthenticationRequired"` - manual end-to-end reconnect flow confirmed - `just fmt`
## Why
Marketplace source deserialization treated `{"source":"npm", ...}` as
unsupported. The loader logged and skipped the entry, so npm-backed
plugins never appeared in `plugin list --available` and `plugin add`
returned "plugin not found".
Codex plugins are installed from a plugin root, not from an npm
dependency tree. For npm-backed marketplace entries, Codex should fetch
the published package contents without running package scripts or
installing unrelated dependencies.
## What changed
- Add `npm` marketplace plugin sources with `package`, optional semver
`version` or version range, and optional HTTPS `registry`.
- Reject unsafe npm source fields before materialization, including
invalid package names, non-semver version selectors, plaintext or
credential-bearing registry URLs, and registry query/fragment data.
- Materialize npm plugins with `npm pack --ignore-scripts`, then unpack
the resulting tarball through the existing hardened plugin bundle
extractor.
- Enforce npm archive and extracted-size limits, require the standard
npm `package/` archive root, and verify the extracted `package.json`
name matches the requested package before installing.
- Keep plugin listings, install-source descriptions, CLI JSON/human
output, app-server v2 `PluginSource`, TUI source summaries, regenerated
schema fixtures, and app-server documentation in sync.
## Impact
Marketplaces can distribute Codex plugins from public or configured
private HTTPS npm registries using the same install flow as existing
materialized plugin sources. `npm` must be available on `PATH` when an
npm-backed plugin is installed.
Fixes #27831
## Validation
- `just write-app-server-schema`
- `just test -p codex-core-plugins -p codex-app-server-protocol -p
codex-app-server -p codex-cli`
- npm/schema/core-plugin coverage passed in the run.
- The full focused command finished with `1739 passed`, `11 failed`, and
`6 timed out`; the failures were unrelated local app-server environment
failures from `sandbox-exec: sandbox_apply: Operation not permitted`
plus one missing `test_stdio_server` helper binary.
- Installed an npm-published Codex plugin package through a throwaway
local marketplace and throwaway `CODEX_HOME` to exercise the real npm
materialization path end to end.
## Why It's hard to change the set of required jobs when they're managed in the GitHub UI, and when each workflow is responsible for choosing it's own scheduling it's easy to end up with skew between what we enforce on PRs vs. on main. ## What - add a `blocking-ci` caller workflow, triggered by pull requests and pushes to `main`, for Bazel, blob size, cargo-deny, Codespell, `repo-checks`, rust CI, and SDK CI - add an `always()` terminal job named `CI required` that fails unless every called workflow succeeds - add a `postmerge-ci` caller workflow for `rust-ci-full` and `v8-canary`, with a terminal `Postmerge CI results` job - centralize V8 relevance detection in `v8_canary_changes.py`; unrelated PR and postmerge runs execute metadata only and skip the expensive build matrices - leave `v8-canary` outside the blocking gate and leave the external `cla` check independent ## Rollout A repository admin must replace the existing required GitHub Actions contexts with `CI required` in the main-branch ruleset. Retain `cla` as a separate required check. Until that change is coordinated, this PR cannot satisfy the old standalone check names. In-flight PRs will need to be rebased after this lands.
## Description
This PR adds canonical core `TurnItem` shapes for command execution,
dynamic tool calls, collab agent tool calls, and sub-agent activity, to
be stored in the rollout file soon.
It also teaches app-server protocol / `ThreadHistoryBuilder` how to
render those items, and adds the small legacy fanout helpers needed for
existing event-based consumers. No core producer or rollout persistence
behavior changes here, that will be done in a followup.
## Making ThreadHistoryBuilder stateless
This is the first PR in a stack to make `ThreadHistoryBuilder` stateless
enough that we can materialize app-server `ThreadItem`s from only a
given slice of `RolloutItem` history, without ever needing to replay the
whole thread from the beginning.
The persisted legacy `RolloutItem::EventMsg` records are mostly shaped
like live UI events, not like materialized `ThreadItem`s. They work if
we replay the full rollout in order, but they often do not contain
enough stable identity or complete item state to project an arbitrary
suffix on its own.
A few examples:
- `UserMessageEvent` and `AgentMessageEvent` have content, but
historically do not carry the persisted app-server item ID that should
become the SQLite primary key.
- `AgentReasoningEvent` and `AgentReasoningRawContentEvent` are
fragments. `ThreadHistoryBuilder` currently merges them into the last
reasoning item, which means a slice starting in the middle of reasoning
cannot know whether to append to an earlier item or create a new one.
- `WebSearchEndEvent`, `McpToolCallEndEvent`, collab end events, and
similar legacy events can often render a final-looking item, but they
usually rely on prior replay state to know which turn owns the item.
- Begin/end legacy events are partial views of one logical item. The
builder correlates them by `call_id` and mutates prior state to
synthesize the final `ThreadItem`.
That is the problem this direction fixes. A persisted canonical
lifecycle record looks much closer to the read model we actually want
later:
```rust
ItemCompletedEvent {
turn_id,
item: TurnItem { id, ...full snapshot... },
completed_at_ms,
}
```
Once rollout has explicit `turn_id`, stable `item.id`, and a canonical
completed item snapshot, the future SQLite projector can reduce only the
new rollout suffix and upsert the affected `thread_items` rows. It no
longer needs to synthesize `item-N`, infer item ownership from the
active turn, or replay earlier events just to reconstruct the current
item snapshot.
## What changed
- Added core `TurnItem` variants and item structs for command execution,
dynamic tool calls, collab agent tool calls, and sub-agent activity.
- Added conversions from those canonical items back into the legacy
event shapes where current consumers still need them.
- Added app-server v2 `ThreadItem` conversion for the new core item
variants.
- Taught `ThreadHistoryBuilder` and rollout persistence metrics to
recognize the new item variants.
## Follow-up
The next PR #30283 switches the live
core producers for these item families onto canonical `ItemStarted` /
`ItemCompleted` events.
## Why Remote-control websocket reconnects and pairing requests proactively refresh their server token. When `/server/refresh` returns a transient error such as `502`, the still-valid token was discarded as a usable connection path, causing reconnect failures and repeated refresh attempts that could amplify an upstream incident. ## What Changed - Start proactive refresh five minutes before token expiry and distinguish it from a required refresh for missing or expired tokens. - Continue websocket and pairing operations with the existing valid token after `429`, `5xx`, or timeout failures. - Share an in-memory `next_refresh_at` throttle across websocket and pairing callers, honoring both `Retry-After` formats and otherwise using a jittered 24–36 second delay. - Keep required refreshes strict, preserve `404` enrollment replacement, and clear token/throttle state for `401` and `403` auth recovery. - Preserve refresh response metadata internally and add focused wire-level and integration coverage. ## Verification Added behavioral coverage proving that: - a valid near-expiry token still completes websocket and pairing requests after transient refresh failures; - `Retry-After` suppresses a subsequent refresh across websocket and pairing callers; - request and response-body timeouts are classified as transient; - an expired token, including one that expires during refresh, cannot proceed to websocket connection; - auth failures clear the attempted token without overwriting a concurrently rotated token.
## Summary - complete unified-exec processes from the ordered event stream instead of issuing a final zero-wait `process/read` - add optional executor sandbox-denial state to `process/exited` - retain `process/read` as a retained-output and compatibility fallback for receiver lag, sequence gaps, and legacy servers - recover sandbox-denial state across transport reconnection - cover the real `TestCodex` remote-exec path without adding a public test-only event constructor ## Why A successful one-shot tool call currently receives its output and terminal notifications, then pays another wide-area `process/read` round trip before returning. Staging traces showed that remote response wait accounted for more than 99.8% of RPC time; local serialization, queueing, and deserialization were below 0.6 ms. ## Measured impact A direct staging A/B used the same build and route and changed only completion mode. Each arm ran three times with 30 one-shot `/usr/bin/true` calls per run. The table reports the median of the three per-run percentiles. | Metric | Final `process/read` | Pushed events | Change | | --- | ---: | ---: | ---: | | End-to-end completion p50 | 159.5 ms | 118.7 ms | -40.8 ms (-25.6%) | | End-to-end completion p95 | 182.4 ms | 131.7 ms | -50.6 ms (-27.8%) | | Completion-wait p50 | 80.1 ms | 41.5 ms | -38.5 ms (-48.1%) | | Final `process/read` RPC p50 | 79.9 ms | eliminated | -79.9 ms | TCP_NODELAY was enabled in both A/B arms, so its effect cancels out. The successful, complete, in-order event path issued zero final `process/read` calls. ## Compatibility and recovery - new servers send `sandboxDenied` on `process/exited` - legacy servers omit it, which triggers one compatibility `process/read` - broadcast lag or a sequence gap triggers a retained-output read - recovery remains bounded by the server's existing 1 MiB retained-output window - complete, in-order event streams issue no completion read - sandbox denial is attached to the exit event before consumers can observe process completion - server-first and client-first rollouts remain wire-compatible; server-first realizes the latency win immediately ## Integration coverage The `TestCodex` suite exercises four distinct remote-exec contracts: - complete pushed output/exit/close with zero reads - direct pushed sandbox denial with zero reads - legacy missing denial metadata with exactly one compatibility read - count-bounded replay eviction recovered from retained output without duplication ## Validation - `just test -p codex-core exec_command_consumes_pushed_remote_process_events`: 4 passed - `just test -p codex-core unified_exec::process_tests::`: 4 passed - `just test -p codex-exec-server`: 294 passed, 2 skipped - `just test -p codex-exec-server-protocol`: 5 passed - `just test -p codex-rmcp-client`: 89 passed, 2 skipped - focused Bazel `//codex-rs/core:core-all-test`: passed across 16 shards - scoped `just fix` passed for core and exec-server - `just fmt` passed The complete workspace suite was not rerun; focused Cargo and Bazel coverage passed for the changed behavior.
## Why Remote diff-root discovery is independent of world-state construction, but it ran afterward and added filesystem metadata latency before the first model request. Overlap the independent work so thread-cold turns do not pay those waits serially. ## What - Run `record_context_updates_and_set_reference_context_item` and `turn_diff_display_roots` with `tokio::join!`. - Reuse the same resolved display roots when constructing `TurnDiffTracker`; no cache or behavior lifecycle changes are introduced. ## Validation A synthetic executor-skill benchmark with artificial network delay: thread-cold model-request p50 improved from about 1.79 s to 1.58 s.
## Why `LOG_FORMAT=json` and `RUST_LOG` are supported by app-server, but the behavior was only covered indirectly. We should verify the actual JSONL written by both user-facing entry points: `codex app-server` and the standalone `codex-app-server` binary. The existing processor shutdown message also always said the channel closed, even though the processor can exit for several different reasons. Structured fields make that event more accurate and useful to log consumers. ## What changed - Record the processor `exit_reason`, remaining connection count, and forced-shutdown state as structured tracing fields. - Add a shared process-test helper that enables JSON logging, validates every stderr line as JSON, and verifies the top-level timestamp is RFC 3339. - Cover both `codex app-server` and `codex-app-server`, asserting the stable `level`, `fields`, and `target` payload. ## Test plan - `just test -p codex-app-server standalone_app_server_emits_json_info_events` - `just test -p codex-cli app_server_emits_json_info_events`
## Summary - Preserve the optional namespace on custom tool calls during response deserialization and app-server replay. - Use the namespaced tool identifier for streaming argument handling and tool dispatch. - Regenerate app-server protocol schemas. - Add regression tests covering namespace serialization and routing. ## Testing - Ran affected protocol and app-server test suites. - Ran the full core test suite; two load-sensitive timing tests passed when rerun individually. - Ran Clippy and formatting checks. - Verified with a local end-to-end app-server replay that the namespace is preserved through the complete request/response flow.
## Why Response item IDs represent stable conversation identity. `ContextManager::for_prompt` repairs an unmatched call by synthesizing an `"aborted"` output in the disposable prompt projection, but that output previously had no ID. Assigning a fresh ID on every prompt build would make retries and resumes change otherwise identical model context and reduce prompt-cache reuse. The concrete bug is that these normalization-created outputs bypass the regular item-ID allocation path. Even with item IDs enabled, a prompt could therefore contain an identified call paired with a synthetic output whose `id` was missing. This change closes that gap by deriving the output ID from the source call's item ID. For legacy calls that have no item ID, the output remains ID-less because there is no stable source identity to derive from. The originating call already has a stable item ID under the item-ID model introduced in #28814. A prompt-only output can therefore derive stable identity from that call without mutating canonical history or persisted rollouts. This addresses the failure exposed by #30311 while keeping normalization read-only outside its detached prompt snapshot. UUIDv5 is intentional here because it is the standard namespaced, deterministic UUID construction. Using the output kind and source call ID as the name produces the same UUID on every projection while keeping output kinds in separate name domains. UUIDv7 would introduce randomness and time, so keeping it stable would require persisting the synthetic repair. UUIDv5 uses SHA-1 internally, but this is only an identity mapping—not an authenticity or security boundary. ## What changed - Derive a deterministic UUIDv5 ID for each synthesized call output from the source call item ID. - Use the Responses API prefix appropriate for function, custom-tool, tool-search, and local-shell outputs. - Preserve the existing insertion position immediately after the unmatched call. - Keep synthesized outputs prompt-only; no rollout, task-lifecycle, compaction, or raw-response behavior changes. ## Testing - `just test -p codex-core for_prompt_assigns_stable_id_to_synthetic_output_without_reordering_history` - `just test -p codex-core synthetic_call_output_id_is_stable_across_resumes` - `just test -p codex-core normalize_adds_missing_output` - `just test -p codex-core response_item_ids`
## Why
App-server clients that configure named execution environments need to
discover an environment's shell and working directory before selecting
it for a thread or turn. Because the environment can run on a different
operating system than app-server, its working directory is represented
as a canonical `file:` URI rather than a host-local path string. The
probe also needs a bounded response time: an exec-server that completes
initialization but never answers `environment/info` must not hold the
environment serialization queue indefinitely.
## What changed
- Add an experimental `environment/info` app-server RPC for named
environments.
- Route the probe through the managed environment connection and return
target-native shell metadata plus the default working directory as a
`PathUri`.
- Return connection and protocol failures as JSON-RPC errors.
- Bound the exec-server probe response to 30 seconds and remove
timed-out calls from the pending-request table so later environment
mutations can proceed.
- Cover successful responses, omitted working directories, unknown
environments, connection failures, and pending-call cleanup.
## Protocol examples
Request:
```json
{
"id": 42,
"method": "environment/info",
"params": {
"environmentId": "remote-a"
}
}
```
Successful response:
```json
{
"id": 42,
"result": {
"shell": {
"name": "zsh",
"path": "/bin/zsh"
},
"cwd": "file:///workspace"
}
}
```
If the exec-server initializes but does not answer the probe within 30
seconds:
```json
{
"id": 42,
"error": {
"code": -32603,
"message": "failed to get info for environment `remote-a`: exec-server protocol error: timed out waiting for exec-server `environment/info` response after 30s"
}
}
```
## Testing
- App-server integration coverage for successful info (including omitted
`cwd`), unknown environments, and connection failures.
- Exec-server RPC coverage verifying a timed-out call is removed from
the pending-request table.
---------
Co-authored-by: Michael Bolin <mbolin@openai.com>
## Summary - project effective marketplace/plugin config through the enterprise source policy so blocked installed plugins become inactive - filter plugin list/read/discovery and CLI marketplace source/snapshot reporting using the same policy - enforce source admission for background marketplace cache refreshes - continue refreshing/upgrading independent marketplaces and plugins when one entry fails, returning per-entry errors - include policy-projected plugin state in cache and refresh keys so requirement changes invalidate stale results ## Stack This is PR 2 of 2 and is based on #29690. Review the admission model and source matcher in #29690 first; this PR contains only runtime enforcement. ## Test plan - `just test -p codex-core-plugins` (287 tests) - `just test -p codex-cli plugin_list_ignores_implicit_system_marketplace_roots_without_manifests` - `cargo check -p codex-cli -p codex-app-server --tests`
## Summary Increase the external currentTime/read request timeout from 5 seconds to 10 seconds. ## Validation - just fmt - Focused app-server test build was stopped to defer validation to CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )