Decision #4 — agenKic-orKistrator

Process Supervision

Why OTP-style “let it crash” over defensive coding, supervisord, or PM2

The Question

How should the orchestrator handle agent crashes, hangs, and hallucinations? AI agents are inherently unreliable. A Claude subprocess can silently stall mid-task, return a malformed response that propagates corrupt state, or consume unbounded CPU in a reasoning loop. The supervision model chosen determines whether a single agent failure cascades to the entire orchestrator or is cleanly isolated, restarted, and reported without disrupting other running agents.

Options Considered

OTP-Style Hierarchical Supervision

Erlang’s “let it crash” philosophy, reimplemented in Go

Chosen

Pros

35 years of production proof in telecom-grade systems
Failure isolation — one agent crash cannot corrupt supervisor state
Exponential backoff + circuit breaker prevents thundering-herd restarts
Hierarchical tree lets each supervisor manage only its direct children
Health probes catch silent hangs that error handlers miss entirely

Cons

Must build from scratch in Go — no off-the-shelf OTP port exists
Threshold tuning (crash window, backoff cap) requires production data

Defensive Coding

Wrap every call in try/catch; log and recover inline

Rejected

Pros

Familiar pattern — every developer already knows how to do this
No extra infrastructure; errors handled at the call site

Cons

Does not scale for AI agents — hallucinations produce no exception to catch
Silent hangs are undetectable without external heartbeat monitoring
Inline recovery logic bleeds into business logic, making code hard to reason about
No systematic backoff — a crashing agent can be relaunched in a tight loop

supervisord

Config-driven process manager for Unix daemons

Rejected

Pros

Battle-tested and well-documented
Zero code to write — purely config-driven restart policies

Cons

Coarse-grained — all agents share a single global restart policy
No per-agent circuit breaker or task-aware backoff
Cannot inspect agent health beyond OS exit code
External daemon creates a deployment dependency for a self-contained binary

systemd / PM2

OS-level process management with built-in restart units

Rejected

Pros

Native to Linux and Node ecosystems respectively
Built-in restart and resource limits with no extra binaries

Cons

No application awareness — restart decisions based solely on exit code
Cannot distinguish a legitimate agent completion from a crash
PM2 is a Node.js tool; adding it to a Go binary is an unnecessary dependency
systemd unit management for ephemeral agent processes is operationally cumbersome

Restart Strategy

Two paths through the supervision loop: the happy path where heartbeats land on schedule, and the crash path where a stale heartbeat triggers the full recovery sequence. The circuit breaker sits above both paths and gates all restart attempts.

Supervision Loop — Happy Path & Crash Path

Agent Running

goroutine

Heartbeat OK

every 5 s

Continue

loop

Stale Heartbeat

timeout > 15 s

EventAgentFailed

emitted

RecordCrash

timestamp + reason

Backoff Delay

exp. jitter

CRASH

Backoff

1 s

attempt 1

→

2 s

attempt 2

→

4 s

attempt 3

→

8 s

attempt 4

→

16 s

attempt 5

→

60 s

cap

±20% jitter

CIRCUIT OPEN

5 crashes within 60 s trips the circuit breaker. Agent is suspended and marked NEEDS_INTERVENTION. No further restart attempts until a human resolves the sentinel state.

The Decision

OTP

Erlang’s supervision tree, adapted for Go goroutines

Each agent is supervised by a Supervisor struct that holds restart policy, crash history, and circuit-breaker state independently from all other agents. A root supervisor owns all child supervisors, so a complete agent subtree can be stopped or restarted without touching the orchestrator’s own event loop. Clean failure boundaries are the structural guarantee, not a runtime hope.

Active health probes via heartbeat channels

Each supervised goroutine writes to a heartbeat channel on a regular interval. The supervisor monitors the channel with a configurable timeout (default 15 s). If no heartbeat arrives within the window, the agent is considered hung and EventAgentFailed is emitted — catching silent stalls that produce no error, panic, or non-zero exit code.

Circuit breaker prevents restart storms

The crash history ring buffer stores the timestamps of the last N failures. When the density of crashes within a sliding 60 s window exceeds the threshold, the circuit opens and all restart attempts cease. The agent enters NEEDS_INTERVENTION state, surfaced in the API and UI. This prevents a systematically broken agent from consuming resources indefinitely.

ISO

Failure isolation: no shared mutable supervisor state

Supervisor state (crash count, circuit status, backoff timer) lives inside the supervisor struct, never in a global map. An agent crashing cannot corrupt the crash counter of a sibling agent. Go’s goroutine model means each supervisor runs its own select loop; panics are caught with recover() inside the agent goroutine wrapper, converted to a structured error, and handed to the supervisor through a typed channel.

Why not just recover() everywhere? — recover() catches panics but not hangs, and requires the developer to reason about partial state after each recovery. OTP inverts this: assume the agent is broken, discard its state entirely, and restart from a clean initial state. This is the “let it crash” insight: clean restarts are cheaper than safe recovery from unknown corruption.

Trade-offs Accepted

Custom Go implementation. There is no mature OTP port for Go. The supervision primitives — supervisor struct, heartbeat channel, crash ring buffer, circuit breaker — must be written and maintained as first-party code. This is intentional: a thin, purpose-built layer is preferable to a heavyweight framework that does not match Go’s concurrency idioms.

Threshold tuning requires production data. The crash window (60 s), crash count (5), heartbeat timeout (15 s), and backoff cap (60 s) are initial estimates. Real agent workloads may need these adjusted. The values are configurable at construction time and will be refined based on observed failure rates once the orchestrator runs real tasks.

Jitter introduces non-determinism. The ±20% jitter on each backoff step means restart timing is not reproducible in tests without seeding the RNG. Tests that assert on restart timing must either inject a deterministic clock or test the behavior bounds rather than exact timestamps.

Sentinel state requires operator action. When a circuit opens, no automated resolution path exists by design. A human must inspect the agent, resolve the root cause, and reset the supervisor via the management API. This is a feature, not a gap — automatically reopening a circuit around a broken AI agent would mask the underlying model or prompt issue indefinitely.