Skip to content

fix: kill stuck agent processes after 5-minute timeout#218

Open
jcenters wants to merge 5 commits intoTinyAGI:mainfrom
jcenters:pr/agent-timeout
Open

fix: kill stuck agent processes after 5-minute timeout#218
jcenters wants to merge 5 commits intoTinyAGI:mainfrom
jcenters:pr/agent-timeout

Conversation

@jcenters
Copy link

Problem

A hung Claude (or Codex/OpenCode) process holds a message in processing status indefinitely. The queue's stale recovery resets the SQLite status back to pending every minute, but doesn't kill the actual OS process — so the message keeps getting re-queued while the original process keeps running. In messaging channels (e.g. Telegram) this shows as the bot typing forever without ever responding.

Observed: a Claude agent process ran for ~70 minutes on a single message, blocking all subsequent messages for that agent.

Fix

Add a timeoutMs parameter (default: 5 minutes) to runCommand() in packages/core/src/invoke.ts. On timeout, the spawned process is sent SIGKILL and the promise rejects with a timeout error.

The existing processMessage() catch block already handles this correctly — it calls failMessage(), which resets the message to pending (or dead after 5 retries). No additional error handling needed.

Behavior after this fix

  • Agent stuck for > 5 min → process killed → message retried fresh
  • If the retry also times out 5 times → message goes dead (visible via queue status)
  • Normal tasks under 5 min are unaffected

Notes

  • timeoutMs is an optional parameter with a backward-compatible default, so all existing runCommand() callers are unaffected
  • Applies to Claude, Codex, and OpenCode providers equally

🤖 Generated with Claude Code

runCommand() now sets a SIGKILL timer (default 5 min) on spawned
agent processes. On timeout it rejects the promise, which triggers
the existing failMessage() path in processMessage() — resetting the
message to pending for retry rather than leaving it stuck forever.

Previously a hung Claude process would hold a message in 'processing'
indefinitely, causing the "typing forever" symptom in Telegram even
after the queue recovered the stale status in SQLite.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Mar 15, 2026

@jcenters is attempting to deploy a commit to the AGI Team on Vercel.

A member of the Team first needs to authorize it.

@greptile-apps
Copy link

greptile-apps bot commented Mar 15, 2026

Greptile Summary

This PR adds a 5-minute SIGKILL timeout to runCommand() in invoke.ts to fix agent processes that hang indefinitely, which previously caused messages to cycle through processing → pending forever without the underlying OS process ever being terminated.

The timer/flag plumbing is correct — timedOut prevents a double rejection when close fires after the kill, and clearTimeout is properly called in both the error and close handlers.

Key issues found:

  • Orphaned sub-processes (logic): child.kill('SIGKILL') only terminates the direct claude/codex/opencode CLI process. These CLIs spawn their own child processes (shell commands, git, compilers, etc.), which survive the kill and become orphaned — potentially holding resources that interfere with the fresh retry. Using detached: true and process.kill(-child.pid, 'SIGKILL') would kill the entire process group.
  • No per-agent timeout configuration (style): The 5-minute default is now silently applied to every runCommand() caller. Agents running legitimately long tasks (large refactors, multi-file generation) will time out with no way to opt out short of modifying the source. Exposing a timeout_ms field in AgentConfig would give operators a safe escape hatch.

Confidence Score: 3/5

  • Safe to merge as a partial fix, but orphaned grandchild processes mean the core resource-leak problem is not fully resolved.
  • The timer/guard logic is sound and addresses the immediate symptom (stuck message in queue). However, killing only the direct child leaves entire subprocess trees running after timeout, which could interfere with retries or accumulate over 5 retry cycles. The hardcoded default also risks false timeouts on long-but-legitimate tasks.
  • packages/core/src/invoke.ts — specifically the kill call at line 26 and the spawn options

Important Files Changed

Filename Overview
packages/core/src/invoke.ts Adds a timeoutMs parameter (default 5 min) to runCommand() that SIGKILLs the spawned process on timeout and rejects the promise. The guard flag and clearTimeout usage are correct, but the kill only targets the direct child — grandchild processes spawned by the agent CLI are orphaned. Additionally, the fixed default is applied to all callers with no per-agent override.

Sequence Diagram

sequenceDiagram
    participant PM as processMessage()
    participant IA as invokeAgent()
    participant RC as runCommand()
    participant CP as ChildProcess (claude/codex/opencode)
    participant GC as Grandchild Processes

    PM->>IA: invoke agent with message
    IA->>RC: runCommand(cmd, args, cwd, env, timeoutMs=5min)
    RC->>CP: spawn(command, args)
    CP->>GC: spawns sub-processes (git, compilers, tools)
    Note over RC: setTimeout(5 min)

    alt Normal completion (< 5 min)
        CP-->>RC: close(code=0)
        RC->>RC: clearTimeout(timer)
        RC-->>IA: resolve(stdout)
        IA-->>PM: return response
    else Timeout (>= 5 min)
        RC->>RC: timedOut = true
        RC->>CP: kill('SIGKILL')
        Note over GC: ⚠️ Grandchildren NOT killed — orphaned
        CP-->>RC: close event fires
        RC->>RC: if (timedOut) return — skip double reject
        RC-->>IA: reject(Error: timed out)
        IA-->>PM: throw error
        PM->>PM: failMessage() → pending or dead
    end
Loading

Last reviewed commit: 13b0c40


const timer = setTimeout(() => {
timedOut = true;
child.kill('SIGKILL');
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orphaned grandchild processes not killed on timeout

child.kill('SIGKILL') only terminates the direct child process (the claude/codex/opencode CLI). These CLIs themselves spawn sub-processes (e.g., shell commands, compilers, linters, git operations) to do their work. Because the spawned process is in the same process group as the parent (no detached: true was set), those grandchildren are reparented to PID 1 and continue running after the timeout. This can:

  • Leak file I/O and CPU for the duration of those sub-tasks
  • Hold file locks or SQLite connections that interfere with the fresh retry
  • On repeated timeouts (up to 5 retries), accumulate many orphaned process trees

To kill the entire process subtree, spawn with detached: true (which creates a new process group) and send SIGKILL to the negative PID:

const child = spawn(command, args, {
    cwd: cwd || SCRIPT_DIR,
    stdio: ['ignore', 'pipe', 'pipe'],
    env,
    detached: true,
});
child.unref(); // don't keep the parent alive for the child

Then in the timer callback:

const timer = setTimeout(() => {
    timedOut = true;
    try {
        process.kill(-child.pid!, 'SIGKILL'); // kills entire process group
    } catch (_) { /* already exited */ }
    reject(new Error(`Agent process timed out after ${timeoutMs / 1000}s`));
}, timeoutMs);

Note: process.kill(-pgid) works on Linux/macOS. On Windows the equivalent is taskkill /F /T /PID, which would need a conditional branch.

import { ensureAgentDirectory, buildSystemPrompt } from './agent';

export async function runCommand(command: string, args: string[], cwd?: string, envOverrides?: Record<string, string>): Promise<string> {
export async function runCommand(command: string, args: string[], cwd?: string, envOverrides?: Record<string, string>, timeoutMs = 5 * 60 * 1000): Promise<string> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5-minute default silently applies to all existing callers

Every call site (runCommand('claude', ...), runCommand('codex', ...), runCommand('opencode', ...)) now has a hard 5-minute ceiling with no way to override it via agent config. Complex agentic tasks (large refactors, multi-file code generation, long-running research loops) can legitimately exceed 5 minutes, and the caller will receive a timeout error indistinguishable from a real hang.

Consider threading the timeout through AgentConfig so operators can tune it per-agent:

// In AgentConfig type
timeout_ms?: number;

// In invokeAgent
const timeoutMs = agent.timeout_ms ?? 5 * 60 * 1000;
return await runCommand('claude', claudeArgs, workingDir, envOverrides, timeoutMs);

This keeps the backward-compatible default while giving operators an escape hatch for agents that are expected to run longer tasks.

jcenters and others added 3 commits March 15, 2026 18:08
- Re-add settings.json to git tracking
- Contains jobs, fitness, content agents
- Prevents accidental config loss

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- @main: Sonnet (no heartbeat)
- @content: Sonnet + heartbeat every 4h
- @Fitness: Haiku
- @jobs: Haiku

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add settings.json to .gitignore to prevent accidental commits of configuration
- Create settings.example.json as a template for future developers
- Prevents exposure of agent configurations, working directories, and future sensitive data
- settings.json currently clean (no API keys/tokens) but this prevents future risks

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
child.kill('SIGKILL') only terminated the top-level claude/codex/opencode
process. Any sub-processes it spawned (git, compilers, shell commands,
MCP servers) survived the kill and accumulated as orphans across retries.

Fix: spawn with detached:true to put the child in its own process group,
then use process.kill(-pid, 'SIGKILL') on timeout to take out the entire
tree. Falls back to child.kill() if the process has already exited.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant