Supervisor process that spawns, monitors, and coordinates AI agent workers — with OTP-style fault tolerance, a DAG task engine, gRPC control plane, and Redis state persistence.
Ctrl/Cmd + wheel to zoom. Scroll to pan. Double-click to fit.
ApplyEvent call returns an AgentSnapshot
with PreviousState, State, and Event. The Machine has no internal
mutable state — all state lives in the StateStore. Callers serialize per-agent via the supervisor's
per-agent mutex.
Manages the agent pool: heartbeat monitoring, task assignment loop, crash handling with OTP-style restart policy (exponential backoff + circuit breaker). Per-agent cooldown and circuit-open maps gate assignment eligibility.
Pure state machine. Reads current state from store, validates transition via the transition table, writes new state back. Returns immutable AgentSnapshot. No internal mutable state.
DAG task engine. Topological sort, parallel fork execution, node status tracking. StoreSubmitter enqueues ready nodes into the task queue. Executor manages concurrent execution contexts.
gRPC service handlers implementing OrchestratorService: RegisterAgent, SubmitTask, GetAgentState, SubmitDAG, GetDAGStatus. Wires to supervisor and DAG executor.
StateStore interface: agent hashes, task queue (sorted set), event stream. MockStore for testing, RedisStore for production. Immutable pattern — SetAgentState returns new copy.
Health probe endpoints: liveness (is the process alive?), readiness (can it accept work?). Checks Redis connectivity and supervisor state.
Client calls SubmitTask via gRPC. Handler enqueues task into the sorted-set queue with priority.
Supervisor's taskAssignLoop dequeues, finds an idle agent (skipping cooldown/circuit-open), applies EventTaskAssigned.
Agent transitions: Assigned → Working. Agent processes the task, streams output. State tracked in Redis hashes.
Agent transitions: Working → Reporting → Idle. Output delivered, policy records success, cooldown cleared.
On crash: EventAgentFailed resets to Idle. RestartPolicy computes backoff. Circuit breaker opens after threshold exceeded.
| Feature | Issue | PR | Status | Scope |
|---|---|---|---|---|
| F1 — Foundation | #5 | #10 | merged | go mod, proto, Redis state, agent state machine |
| F2 — gRPC + Supervisor | #6 | #11 | merged | gRPC server, supervisor heartbeat + assign loops |
| F3 — DAG Task Engine | #7 | #12 | merged | Topological sort, parallel fork, executor |
| F2-F3 Integration | #19 | #20 | merged | Wire DAG executor into gRPC handlers |
| F4 — Health Probes | #8 | #13 | merged | Liveness, readiness endpoints |
| F5 — E2E Lifecycle Tests | #9 | #21 | merged | 11 E2E scenarios, supervisor↔policy wiring |
| Optimistic Locking | #15 | #22 | merged | CompareAndSetAgentState for StateStore |
| Event Stream Consumption | #17 | #23 | merged | Stream consumption methods for StateStore |
| Gateway Interface | #32 | — | pending | Gateway types, judge-router, LiteLLM client |
| Terminal Substrate | #42 | — | pending | tmux session management, command injection |
Crash handling flow — from heartbeat stale detection to circuit breaker
applyEvent, crashAgent pre-populates
a 24-hour cooldown sentinel under sv.mu.Lock(). This closes the TOCTOU race window: findIdleAgent
reads agentCooldown under sv.mu.RLock() and skips the agent immediately. After applyEvent,
the sentinel is either cleaned up (spurious crash) or overwritten with the real backoff duration.