fix: kill stuck agent processes after 5-minute timeout by jcenters · Pull Request #218 · TinyAGI/tinyagi

jcenters · 2026-03-15T15:11:39Z

Problem

A hung Claude (or Codex/OpenCode) process holds a message in processing status indefinitely. The queue's stale recovery resets the SQLite status back to pending every minute, but doesn't kill the actual OS process — so the message keeps getting re-queued while the original process keeps running. In messaging channels (e.g. Telegram) this shows as the bot typing forever without ever responding.

Observed: a Claude agent process ran for ~70 minutes on a single message, blocking all subsequent messages for that agent.

Fix

Add a timeoutMs parameter (default: 5 minutes) to runCommand() in packages/core/src/invoke.ts. On timeout, the spawned process is sent SIGKILL and the promise rejects with a timeout error.

The existing processMessage() catch block already handles this correctly — it calls failMessage(), which resets the message to pending (or dead after 5 retries). No additional error handling needed.

Behavior after this fix

Agent stuck for > 5 min → process killed → message retried fresh
If the retry also times out 5 times → message goes dead (visible via queue status)
Normal tasks under 5 min are unaffected

Notes

timeoutMs is an optional parameter with a backward-compatible default, so all existing runCommand() callers are unaffected
Applies to Claude, Codex, and OpenCode providers equally

🤖 Generated with Claude Code

runCommand() now sets a SIGKILL timer (default 5 min) on spawned agent processes. On timeout it rejects the promise, which triggers the existing failMessage() path in processMessage() — resetting the message to pending for retry rather than leaving it stuck forever. Previously a hung Claude process would hold a message in 'processing' indefinitely, causing the "typing forever" symptom in Telegram even after the queue recovered the stale status in SQLite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vercel · 2026-03-15T15:11:44Z

@jcenters is attempting to deploy a commit to the AGI Team on Vercel.

A member of the Team first needs to authorize it.

greptile-apps · 2026-03-15T15:13:41Z

Greptile Summary

This PR adds a 5-minute SIGKILL timeout to runCommand() in invoke.ts to fix agent processes that hang indefinitely, which previously caused messages to cycle through processing → pending forever without the underlying OS process ever being terminated.

The timer/flag plumbing is correct — timedOut prevents a double rejection when close fires after the kill, and clearTimeout is properly called in both the error and close handlers.

Key issues found:

Orphaned sub-processes (logic): child.kill('SIGKILL') only terminates the direct claude/codex/opencode CLI process. These CLIs spawn their own child processes (shell commands, git, compilers, etc.), which survive the kill and become orphaned — potentially holding resources that interfere with the fresh retry. Using detached: true and process.kill(-child.pid, 'SIGKILL') would kill the entire process group.
No per-agent timeout configuration (style): The 5-minute default is now silently applied to every runCommand() caller. Agents running legitimately long tasks (large refactors, multi-file generation) will time out with no way to opt out short of modifying the source. Exposing a timeout_ms field in AgentConfig would give operators a safe escape hatch.

Confidence Score: 3/5

Safe to merge as a partial fix, but orphaned grandchild processes mean the core resource-leak problem is not fully resolved.
The timer/guard logic is sound and addresses the immediate symptom (stuck message in queue). However, killing only the direct child leaves entire subprocess trees running after timeout, which could interfere with retries or accumulate over 5 retry cycles. The hardcoded default also risks false timeouts on long-but-legitimate tasks.
packages/core/src/invoke.ts — specifically the kill call at line 26 and the spawn options

Important Files Changed

Filename	Overview
packages/core/src/invoke.ts	Adds a `timeoutMs` parameter (default 5 min) to `runCommand()` that SIGKILLs the spawned process on timeout and rejects the promise. The guard flag and clearTimeout usage are correct, but the kill only targets the direct child — grandchild processes spawned by the agent CLI are orphaned. Additionally, the fixed default is applied to all callers with no per-agent override.

Sequence Diagram

sequenceDiagram
    participant PM as processMessage()
    participant IA as invokeAgent()
    participant RC as runCommand()
    participant CP as ChildProcess (claude/codex/opencode)
    participant GC as Grandchild Processes

    PM->>IA: invoke agent with message
    IA->>RC: runCommand(cmd, args, cwd, env, timeoutMs=5min)
    RC->>CP: spawn(command, args)
    CP->>GC: spawns sub-processes (git, compilers, tools)
    Note over RC: setTimeout(5 min)

    alt Normal completion (< 5 min)
        CP-->>RC: close(code=0)
        RC->>RC: clearTimeout(timer)
        RC-->>IA: resolve(stdout)
        IA-->>PM: return response
    else Timeout (>= 5 min)
        RC->>RC: timedOut = true
        RC->>CP: kill('SIGKILL')
        Note over GC: ⚠️ Grandchildren NOT killed — orphaned
        CP-->>RC: close event fires
        RC->>RC: if (timedOut) return — skip double reject
        RC-->>IA: reject(Error: timed out)
        IA-->>PM: throw error
        PM->>PM: failMessage() → pending or dead
    end

_{Last reviewed commit: 13b0c40}

greptile-apps · 2026-03-15T15:13:44Z

packages/core/src/invoke.ts

+
+        const timer = setTimeout(() => {
+            timedOut = true;
+            child.kill('SIGKILL');


Orphaned grandchild processes not killed on timeout

child.kill('SIGKILL') only terminates the direct child process (the claude/codex/opencode CLI). These CLIs themselves spawn sub-processes (e.g., shell commands, compilers, linters, git operations) to do their work. Because the spawned process is in the same process group as the parent (no detached: true was set), those grandchildren are reparented to PID 1 and continue running after the timeout. This can:

Leak file I/O and CPU for the duration of those sub-tasks

Hold file locks or SQLite connections that interfere with the fresh retry

On repeated timeouts (up to 5 retries), accumulate many orphaned process trees

To kill the entire process subtree, spawn with detached: true (which creates a new process group) and send SIGKILL to the negative PID:

const child = spawn(command, args, { cwd: cwd || SCRIPT_DIR, stdio: ['ignore', 'pipe', 'pipe'], env, detached: true, }); child.unref(); // don't keep the parent alive for the child

Then in the timer callback:

const timer = setTimeout(() => { timedOut = true; try { process.kill(-child.pid!, 'SIGKILL'); // kills entire process group } catch (_) { /* already exited */ } reject(new Error(`Agent process timed out after ${timeoutMs / 1000}s`)); }, timeoutMs);

Note: process.kill(-pgid) works on Linux/macOS. On Windows the equivalent is taskkill /F /T /PID, which would need a conditional branch.

greptile-apps · 2026-03-15T15:13:46Z

packages/core/src/invoke.ts

 import { ensureAgentDirectory, buildSystemPrompt } from './agent';

-export async function runCommand(command: string, args: string[], cwd?: string, envOverrides?: Record<string, string>): Promise<string> {
+export async function runCommand(command: string, args: string[], cwd?: string, envOverrides?: Record<string, string>, timeoutMs = 5 * 60 * 1000): Promise<string> {


5-minute default silently applies to all existing callers

Every call site (runCommand('claude', ...), runCommand('codex', ...), runCommand('opencode', ...)) now has a hard 5-minute ceiling with no way to override it via agent config. Complex agentic tasks (large refactors, multi-file code generation, long-running research loops) can legitimately exceed 5 minutes, and the caller will receive a timeout error indistinguishable from a real hang.

Consider threading the timeout through AgentConfig so operators can tune it per-agent:

// In AgentConfig type timeout_ms?: number; // In invokeAgent const timeoutMs = agent.timeout_ms ?? 5 * 60 * 1000; return await runCommand('claude', claudeArgs, workingDir, envOverrides, timeoutMs);

This keeps the backward-compatible default while giving operators an escape hatch for agents that are expected to run longer tasks.

- Re-add settings.json to git tracking - Contains jobs, fitness, content agents - Prevents accidental config loss Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@main

- @main: Sonnet (no heartbeat) - @content: Sonnet + heartbeat every 4h - @Fitness: Haiku - @jobs: Haiku Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add settings.json to .gitignore to prevent accidental commits of configuration - Create settings.example.json as a template for future developers - Prevents exposure of agent configurations, working directories, and future sensitive data - settings.json currently clean (no API keys/tokens) but this prevents future risks Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

child.kill('SIGKILL') only terminated the top-level claude/codex/opencode process. Any sub-processes it spawned (git, compilers, shell commands, MCP servers) survived the kill and accumulated as orphans across retries. Fix: spawn with detached:true to put the child in its own process group, then use process.kill(-pid, 'SIGKILL') on timeout to take out the entire tree. Falls back to child.kill() if the process has already exited. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps bot reviewed Mar 15, 2026

View reviewed changes

jcenters and others added 3 commits March 15, 2026 18:08

Restore settings.json with agents configuration

40f89a0

- Re-add settings.json to git tracking - Contains jobs, fitness, content agents - Prevents accidental config loss Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Restore full agent configuration with original models and heartbeat

cecb5e0

- @main: Sonnet (no heartbeat) - @content: Sonnet + heartbeat every 4h - @Fitness: Haiku - @jobs: Haiku Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions bot mentioned this pull request Mar 16, 2026

🦞 OpenClaw 生态日报 2026-03-16 gsscsd/big_model_radar#44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: kill stuck agent processes after 5-minute timeout#218

fix: kill stuck agent processes after 5-minute timeout#218
jcenters wants to merge 5 commits intoTinyAGI:mainfrom
jcenters:pr/agent-timeout

jcenters commented Mar 15, 2026

Uh oh!

vercel bot commented Mar 15, 2026

Uh oh!

greptile-apps bot commented Mar 15, 2026

Uh oh!

greptile-apps bot Mar 15, 2026

Uh oh!

greptile-apps bot Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcenters commented Mar 15, 2026

Problem

Fix

Behavior after this fix

Notes

Uh oh!

vercel bot commented Mar 15, 2026

Uh oh!

greptile-apps bot commented Mar 15, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant