Epic #1 — agenKic-orKistrator

Go Orchestrator Core

Supervisor process that spawns, monitors, and coordinates AI agent workers — with OTP-style fault tolerance, a DAG task engine, gRPC control plane, and Redis state persistence.

9features merged
0PRs open
10Go packages
11E2E scenarios

System Architecture

Ctrl/Cmd + wheel to zoom. Scroll to pan. Double-click to fit.

Loading...
Core packages
State store
Future (not yet implemented)
Proto definitions

Agent State Machine

Idle
TaskAssigned
Assigned
WorkStarted
Working
OutputReady
Reporting
OutputDelivered
Idle
EventAgentFailed — resets to Idle from any state
Immutable snapshots. Every ApplyEvent call returns an AgentSnapshot with PreviousState, State, and Event. The Machine has no internal mutable state — all state lives in the StateStore. Callers serialize per-agent via the supervisor's per-agent mutex.

Module Breakdown

SV

supervisor

internal/supervisor/

Manages the agent pool: heartbeat monitoring, task assignment loop, crash handling with OTP-style restart policy (exponential backoff + circuit breaker). Per-agent cooldown and circuit-open maps gate assignment eligibility.

  • supervisor.go
  • restart.go
  • errors.go
  • export_e2e.go
AG

agent

internal/agent/

Pure state machine. Reads current state from store, validates transition via the transition table, writes new state back. Returns immutable AgentSnapshot. No internal mutable state.

  • types.go
  • machine.go
  • transition.go
  • errors.go
DG

dag

internal/dag/

DAG task engine. Topological sort, parallel fork execution, node status tracking. StoreSubmitter enqueues ready nodes into the task queue. Executor manages concurrent execution contexts.

  • dag.go
  • executor.go
  • submitter.go
GP

ipc

internal/ipc/

gRPC service handlers implementing OrchestratorService: RegisterAgent, SubmitTask, GetAgentState, SubmitDAG, GetDAGStatus. Wires to supervisor and DAG executor.

  • server.go
  • handlers.go
ST

state

internal/state/

StateStore interface: agent hashes, task queue (sorted set), event stream. MockStore for testing, RedisStore for production. Immutable pattern — SetAgentState returns new copy.

  • store.go
  • redis.go
  • mock.go
  • errors.go
HP

health

internal/health/

Health probe endpoints: liveness (is the process alive?), readiness (can it accept work?). Checks Redis connectivity and supervisor state.

  • health.go
  • probes.go

Task Data Flow

1

Submit

Client calls SubmitTask via gRPC. Handler enqueues task into the sorted-set queue with priority.

2

Assign

Supervisor's taskAssignLoop dequeues, finds an idle agent (skipping cooldown/circuit-open), applies EventTaskAssigned.

3

Execute

Agent transitions: Assigned → Working. Agent processes the task, streams output. State tracked in Redis hashes.

4

Report

Agent transitions: Working → Reporting → Idle. Output delivered, policy records success, cooldown cleared.

5

Fail & Recover

On crash: EventAgentFailed resets to Idle. RestartPolicy computes backoff. Circuit breaker opens after threshold exceeded.

Implementation Status

Feature Issue PR Status Scope
F1 — Foundation #5 #10 merged go mod, proto, Redis state, agent state machine
F2 — gRPC + Supervisor #6 #11 merged gRPC server, supervisor heartbeat + assign loops
F3 — DAG Task Engine #7 #12 merged Topological sort, parallel fork, executor
F2-F3 Integration #19 #20 merged Wire DAG executor into gRPC handlers
F4 — Health Probes #8 #13 merged Liveness, readiness endpoints
F5 — E2E Lifecycle Tests #9 #21 merged 11 E2E scenarios, supervisor↔policy wiring
Optimistic Locking #15 #22 merged CompareAndSetAgentState for StateStore
Event Stream Consumption #17 #23 merged Stream consumption methods for StateStore
Gateway Interface #32 pending Gateway types, judge-router, LiteLLM client
Terminal Substrate #42 pending tmux session management, command injection

Supervision & Restart Policy

Crash handling flow — from heartbeat stale detection to circuit breaker

Loading...
Sentinel pattern. Before calling applyEvent, crashAgent pre-populates a 24-hour cooldown sentinel under sv.mu.Lock(). This closes the TOCTOU race window: findIdleAgent reads agentCooldown under sv.mu.RLock() and skips the agent immediately. After applyEvent, the sentinel is either cleaned up (spurious crash) or overwritten with the real backoff duration.
agenKic-orKistrator — Epic #1: Go Orchestrator Core Generated 2026-03-22