Why OTP-style “let it crash” over defensive coding, supervisord, or PM2
How should the orchestrator handle agent crashes, hangs, and hallucinations? AI agents are inherently unreliable. A Claude subprocess can silently stall mid-task, return a malformed response that propagates corrupt state, or consume unbounded CPU in a reasoning loop. The supervision model chosen determines whether a single agent failure cascades to the entire orchestrator or is cleanly isolated, restarted, and reported without disrupting other running agents.
Two paths through the supervision loop: the happy path where heartbeats land on schedule, and the crash path where a stale heartbeat triggers the full recovery sequence. The circuit breaker sits above both paths and gates all restart attempts.
Each agent is supervised by a Supervisor struct that holds restart
policy, crash history, and circuit-breaker state independently from all other agents.
A root supervisor owns all child supervisors, so a complete agent subtree can be
stopped or restarted without touching the orchestrator’s own event loop.
Clean failure boundaries are the structural guarantee, not a runtime hope.
Each supervised goroutine writes to a heartbeat channel on a regular interval.
The supervisor monitors the channel with a configurable timeout (default 15 s).
If no heartbeat arrives within the window, the agent is considered hung and
EventAgentFailed is emitted — catching silent stalls that produce
no error, panic, or non-zero exit code.
The crash history ring buffer stores the timestamps of the last N failures.
When the density of crashes within a sliding 60 s window exceeds the threshold,
the circuit opens and all restart attempts cease. The agent enters
NEEDS_INTERVENTION state, surfaced in the API and UI.
This prevents a systematically broken agent from consuming resources indefinitely.
Supervisor state (crash count, circuit status, backoff timer) lives inside the
supervisor struct, never in a global map. An agent crashing cannot corrupt the
crash counter of a sibling agent. Go’s goroutine model means each supervisor
runs its own select loop; panics are caught with recover() inside
the agent goroutine wrapper, converted to a structured error, and handed to the
supervisor through a typed channel.
recover() catches panics but not hangs, and requires the developer to
reason about partial state after each recovery. OTP inverts this: assume the agent is
broken, discard its state entirely, and restart from a clean initial state. This is
the “let it crash” insight: clean restarts are cheaper than safe recovery
from unknown corruption.