Decision #4 — agenKic-orKistrator

Process Supervision

Why OTP-style “let it crash” over defensive coding, supervisord, or PM2

The Question

How should the orchestrator handle agent crashes, hangs, and hallucinations? AI agents are inherently unreliable. A Claude subprocess can silently stall mid-task, return a malformed response that propagates corrupt state, or consume unbounded CPU in a reasoning loop. The supervision model chosen determines whether a single agent failure cascades to the entire orchestrator or is cleanly isolated, restarted, and reported without disrupting other running agents.

Options Considered

OTP-Style Hierarchical Supervision
Erlang’s “let it crash” philosophy, reimplemented in Go
Chosen
Pros
  • 35 years of production proof in telecom-grade systems
  • Failure isolation — one agent crash cannot corrupt supervisor state
  • Exponential backoff + circuit breaker prevents thundering-herd restarts
  • Hierarchical tree lets each supervisor manage only its direct children
  • Health probes catch silent hangs that error handlers miss entirely
Cons
  • Must build from scratch in Go — no off-the-shelf OTP port exists
  • Threshold tuning (crash window, backoff cap) requires production data
Defensive Coding
Wrap every call in try/catch; log and recover inline
Rejected
Pros
  • Familiar pattern — every developer already knows how to do this
  • No extra infrastructure; errors handled at the call site
Cons
  • Does not scale for AI agents — hallucinations produce no exception to catch
  • Silent hangs are undetectable without external heartbeat monitoring
  • Inline recovery logic bleeds into business logic, making code hard to reason about
  • No systematic backoff — a crashing agent can be relaunched in a tight loop
supervisord
Config-driven process manager for Unix daemons
Rejected
Pros
  • Battle-tested and well-documented
  • Zero code to write — purely config-driven restart policies
Cons
  • Coarse-grained — all agents share a single global restart policy
  • No per-agent circuit breaker or task-aware backoff
  • Cannot inspect agent health beyond OS exit code
  • External daemon creates a deployment dependency for a self-contained binary
systemd / PM2
OS-level process management with built-in restart units
Rejected
Pros
  • Native to Linux and Node ecosystems respectively
  • Built-in restart and resource limits with no extra binaries
Cons
  • No application awareness — restart decisions based solely on exit code
  • Cannot distinguish a legitimate agent completion from a crash
  • PM2 is a Node.js tool; adding it to a Go binary is an unnecessary dependency
  • systemd unit management for ephemeral agent processes is operationally cumbersome

Restart Strategy

Two paths through the supervision loop: the happy path where heartbeats land on schedule, and the crash path where a stale heartbeat triggers the full recovery sequence. The circuit breaker sits above both paths and gates all restart attempts.

Supervision Loop — Happy Path & Crash Path
Agent Running
goroutine
Heartbeat OK
every 5 s
Continue
loop
OK
Stale Heartbeat
timeout > 15 s
EventAgentFailed
emitted
RecordCrash
timestamp + reason
Backoff Delay
exp. jitter
CRASH
Backoff
1 s
attempt 1
2 s
attempt 2
4 s
attempt 3
8 s
attempt 4
16 s
attempt 5
60 s
cap
±20% jitter
CIRCUIT OPEN
5 crashes within 60 s trips the circuit breaker. Agent is suspended and marked NEEDS_INTERVENTION. No further restart attempts until a human resolves the sentinel state.

The Decision

OTP

Erlang’s supervision tree, adapted for Go goroutines

Each agent is supervised by a Supervisor struct that holds restart policy, crash history, and circuit-breaker state independently from all other agents. A root supervisor owns all child supervisors, so a complete agent subtree can be stopped or restarted without touching the orchestrator’s own event loop. Clean failure boundaries are the structural guarantee, not a runtime hope.

HB

Active health probes via heartbeat channels

Each supervised goroutine writes to a heartbeat channel on a regular interval. The supervisor monitors the channel with a configurable timeout (default 15 s). If no heartbeat arrives within the window, the agent is considered hung and EventAgentFailed is emitted — catching silent stalls that produce no error, panic, or non-zero exit code.

CB

Circuit breaker prevents restart storms

The crash history ring buffer stores the timestamps of the last N failures. When the density of crashes within a sliding 60 s window exceeds the threshold, the circuit opens and all restart attempts cease. The agent enters NEEDS_INTERVENTION state, surfaced in the API and UI. This prevents a systematically broken agent from consuming resources indefinitely.

ISO

Failure isolation: no shared mutable supervisor state

Supervisor state (crash count, circuit status, backoff timer) lives inside the supervisor struct, never in a global map. An agent crashing cannot corrupt the crash counter of a sibling agent. Go’s goroutine model means each supervisor runs its own select loop; panics are caught with recover() inside the agent goroutine wrapper, converted to a structured error, and handed to the supervisor through a typed channel.

Why not just recover() everywhere?recover() catches panics but not hangs, and requires the developer to reason about partial state after each recovery. OTP inverts this: assume the agent is broken, discard its state entirely, and restart from a clean initial state. This is the “let it crash” insight: clean restarts are cheaper than safe recovery from unknown corruption.

Trade-offs Accepted

Custom Go implementation.  There is no mature OTP port for Go. The supervision primitives — supervisor struct, heartbeat channel, crash ring buffer, circuit breaker — must be written and maintained as first-party code. This is intentional: a thin, purpose-built layer is preferable to a heavyweight framework that does not match Go’s concurrency idioms.
Threshold tuning requires production data.  The crash window (60 s), crash count (5), heartbeat timeout (15 s), and backoff cap (60 s) are initial estimates. Real agent workloads may need these adjusted. The values are configurable at construction time and will be refined based on observed failure rates once the orchestrator runs real tasks.
Jitter introduces non-determinism.  The ±20% jitter on each backoff step means restart timing is not reproducible in tests without seeding the RNG. Tests that assert on restart timing must either inject a deterministic clock or test the behavior bounds rather than exact timestamps.
Sentinel state requires operator action.  When a circuit opens, no automated resolution path exists by design. A human must inspect the agent, resolve the root cause, and reset the supervisor via the management API. This is a feature, not a gap — automatically reopening a circuit around a broken AI agent would mask the underlying model or prompt issue indefinitely.