fix: kill stuck agent processes after 5-minute timeout#218
fix: kill stuck agent processes after 5-minute timeout#218jcenters wants to merge 5 commits intoTinyAGI:mainfrom
Conversation
runCommand() now sets a SIGKILL timer (default 5 min) on spawned agent processes. On timeout it rejects the promise, which triggers the existing failMessage() path in processMessage() — resetting the message to pending for retry rather than leaving it stuck forever. Previously a hung Claude process would hold a message in 'processing' indefinitely, causing the "typing forever" symptom in Telegram even after the queue recovered the stale status in SQLite. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@jcenters is attempting to deploy a commit to the AGI Team on Vercel. A member of the Team first needs to authorize it. |
Greptile SummaryThis PR adds a 5-minute The timer/flag plumbing is correct — Key issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant PM as processMessage()
participant IA as invokeAgent()
participant RC as runCommand()
participant CP as ChildProcess (claude/codex/opencode)
participant GC as Grandchild Processes
PM->>IA: invoke agent with message
IA->>RC: runCommand(cmd, args, cwd, env, timeoutMs=5min)
RC->>CP: spawn(command, args)
CP->>GC: spawns sub-processes (git, compilers, tools)
Note over RC: setTimeout(5 min)
alt Normal completion (< 5 min)
CP-->>RC: close(code=0)
RC->>RC: clearTimeout(timer)
RC-->>IA: resolve(stdout)
IA-->>PM: return response
else Timeout (>= 5 min)
RC->>RC: timedOut = true
RC->>CP: kill('SIGKILL')
Note over GC: ⚠️ Grandchildren NOT killed — orphaned
CP-->>RC: close event fires
RC->>RC: if (timedOut) return — skip double reject
RC-->>IA: reject(Error: timed out)
IA-->>PM: throw error
PM->>PM: failMessage() → pending or dead
end
Last reviewed commit: 13b0c40 |
packages/core/src/invoke.ts
Outdated
|
|
||
| const timer = setTimeout(() => { | ||
| timedOut = true; | ||
| child.kill('SIGKILL'); |
There was a problem hiding this comment.
Orphaned grandchild processes not killed on timeout
child.kill('SIGKILL') only terminates the direct child process (the claude/codex/opencode CLI). These CLIs themselves spawn sub-processes (e.g., shell commands, compilers, linters, git operations) to do their work. Because the spawned process is in the same process group as the parent (no detached: true was set), those grandchildren are reparented to PID 1 and continue running after the timeout. This can:
- Leak file I/O and CPU for the duration of those sub-tasks
- Hold file locks or SQLite connections that interfere with the fresh retry
- On repeated timeouts (up to 5 retries), accumulate many orphaned process trees
To kill the entire process subtree, spawn with detached: true (which creates a new process group) and send SIGKILL to the negative PID:
const child = spawn(command, args, {
cwd: cwd || SCRIPT_DIR,
stdio: ['ignore', 'pipe', 'pipe'],
env,
detached: true,
});
child.unref(); // don't keep the parent alive for the child
Then in the timer callback:
const timer = setTimeout(() => {
timedOut = true;
try {
process.kill(-child.pid!, 'SIGKILL'); // kills entire process group
} catch (_) { /* already exited */ }
reject(new Error(`Agent process timed out after ${timeoutMs / 1000}s`));
}, timeoutMs);
Note: process.kill(-pgid) works on Linux/macOS. On Windows the equivalent is taskkill /F /T /PID, which would need a conditional branch.
| import { ensureAgentDirectory, buildSystemPrompt } from './agent'; | ||
|
|
||
| export async function runCommand(command: string, args: string[], cwd?: string, envOverrides?: Record<string, string>): Promise<string> { | ||
| export async function runCommand(command: string, args: string[], cwd?: string, envOverrides?: Record<string, string>, timeoutMs = 5 * 60 * 1000): Promise<string> { |
There was a problem hiding this comment.
5-minute default silently applies to all existing callers
Every call site (runCommand('claude', ...), runCommand('codex', ...), runCommand('opencode', ...)) now has a hard 5-minute ceiling with no way to override it via agent config. Complex agentic tasks (large refactors, multi-file code generation, long-running research loops) can legitimately exceed 5 minutes, and the caller will receive a timeout error indistinguishable from a real hang.
Consider threading the timeout through AgentConfig so operators can tune it per-agent:
// In AgentConfig type
timeout_ms?: number;
// In invokeAgent
const timeoutMs = agent.timeout_ms ?? 5 * 60 * 1000;
return await runCommand('claude', claudeArgs, workingDir, envOverrides, timeoutMs);This keeps the backward-compatible default while giving operators an escape hatch for agents that are expected to run longer tasks.
- Re-add settings.json to git tracking - Contains jobs, fitness, content agents - Prevents accidental config loss Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add settings.json to .gitignore to prevent accidental commits of configuration - Create settings.example.json as a template for future developers - Prevents exposure of agent configurations, working directories, and future sensitive data - settings.json currently clean (no API keys/tokens) but this prevents future risks Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
child.kill('SIGKILL') only terminated the top-level claude/codex/opencode
process. Any sub-processes it spawned (git, compilers, shell commands,
MCP servers) survived the kill and accumulated as orphans across retries.
Fix: spawn with detached:true to put the child in its own process group,
then use process.kill(-pid, 'SIGKILL') on timeout to take out the entire
tree. Falls back to child.kill() if the process has already exited.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Problem
A hung Claude (or Codex/OpenCode) process holds a message in
processingstatus indefinitely. The queue's stale recovery resets the SQLite status back topendingevery minute, but doesn't kill the actual OS process — so the message keeps getting re-queued while the original process keeps running. In messaging channels (e.g. Telegram) this shows as the bot typing forever without ever responding.Observed: a Claude agent process ran for ~70 minutes on a single message, blocking all subsequent messages for that agent.
Fix
Add a
timeoutMsparameter (default: 5 minutes) torunCommand()inpackages/core/src/invoke.ts. On timeout, the spawned process is sentSIGKILLand the promise rejects with a timeout error.The existing
processMessage()catch block already handles this correctly — it callsfailMessage(), which resets the message topending(ordeadafter 5 retries). No additional error handling needed.Behavior after this fix
dead(visible via queue status)Notes
timeoutMsis an optional parameter with a backward-compatible default, so all existingrunCommand()callers are unaffected🤖 Generated with Claude Code