Feature #15 — Implementation

CompareAndSetAgentState

PR #22 merged 8 council reviews epic/1-implement-go-orchestrator-core

Atomic optimistic locking for agent state transitions. Replaces blind SetAgentState writes with CompareAndSetAgentState — a CAS primitive backed by atomic Lua scripts in Redis. Closes the TOCTOU race window between state reads and writes in Machine.ApplyEvent, making concurrent supervisor operations safe without global locks.

How CAS flows through the system

Supervisor
tryAssignTask() acquires per-agent mutex, calls Machine.ApplyEvent, distinguishes *StateConflictError from hard failures via errors.As
Machine
ApplyEvent(): read state → validate transition → CompareAndSetAgentState(expected, next) atomic
StateStore
Interface method: CompareAndSetAgentState(ctx, agentID, expected, next) error interface
Redis
casScript — Lua script: HGET compare + conditional HSET in one atomic round-trip Lua atomic

Agent lifecycle transitions

Ctrl/Cmd + wheel to zoom · Scroll to pan · Double-click to fit

Loading...
Every arrow is a CAS operation. Machine.ApplyEvent reads the current state, validates the transition via the pure ValidTransition(from, event) function, then atomically persists the new state with CompareAndSetAgentState(expected=current, next=target). If another goroutine changed the state between read and write, CAS returns *StateConflictError.

Atomic compare-and-swap sequence

Ctrl/Cmd + wheel to zoom · Scroll to pan · Double-click to fit

Loading...

Atomic Redis operations

casScript redis.go:123-133

Atomic compare-and-swap for agent state. Single Redis round-trip via EVAL.

local current = redis.call('HGET', KEYS[1], 'state')
if current == false then
    return -1  -- agent not found
end
if current ~= ARGV[1] then
    return {0, current}  -- conflict
end
redis.call('HSET', KEYS[1], 'state', ARGV[2])
return 1  -- success
internal/state/redis.go:123–133

clearTaskScript redis.go:144-151

Atomic task field clearing. Guards against ghost hashes from concurrent deletes.

local exists = redis.call('EXISTS', KEYS[1])
if exists == 0 then
    return -1  -- agent not found
end
redis.call('HSET', KEYS[1],
  'current_task_id', '',
  'current_task_priority', '0')
return 1  -- cleared
internal/state/redis.go:144–151

Supervisor conflict resolution

CAS Success

State transitioned atomically. Supervisor proceeds with field writes (CurrentTaskID, priority) under per-agent mutex.

supervisor.go → tryAssignTask

*StateConflictError

Healthy concurrency — another goroutine won the race. Task re-enqueued at original priority. No backoff triggered. errors.As distinguishes from store errors.

supervisor.go:193–198

Store Error

Redis connectivity or unexpected failure. Triggers recordAssignError() with exponential backoff to protect degraded stores.

supervisor.go:322
Key design decision. CAS conflicts are not errors — they signal healthy concurrency. The supervisor skips recordAssignError() for *StateConflictError, preventing false positive backoff that would throttle the assign loop during normal concurrent operation.

Conformance + hermetic + concurrent

Test Layer Strategy What it proves
CAS_Success conformance Both MockStore + Redis CAS swaps when expected matches
CAS_Conflict conformance Both MockStore + Redis *StateConflictError with correct Expected/Actual
CAS_AgentNotFound conformance Both MockStore + Redis ErrAgentNotFound for unknown agent
CAS_ConcurrentRace conformance 10 goroutines, exactly 1 wins Atomicity under real concurrency
ApplyEvent_CASConflict hermetic racyStore injection CAS conflict error-handling path fires correctly
ApplyEvent_ConcurrentCAS concurrent 10 goroutines, per-error counters Exactly-one-winner under real scheduling
CASConflict_NoBackoff supervisor casConflictStore wrapper CAS conflicts skip exponential backoff
CASGenericError_Backoff supervisor MockStore error hook Non-CAS errors trigger backoff
ClearCurrentTask conformance Both MockStore + Redis Atomic clear, ErrAgentNotFound for missing

8 rounds of adversarial review

Round Verdict Key Resolution
R1CONDITIONALLua TOCTOU fix, test assertions, supervisor error handling
R2CONDITIONALSupervisor CAS integration test
R3CONDITIONALBackoff misclassification, silent error logging
R4FORAll R1-R3 conditions verified resolved
R5CONDITIONALDoc comment fix, dead code removal, input validation
R6FORAll R5 remediations verified
R7FORAll R6 follow-ons verified, 3 new follow-ons
R8FORAll R7 follow-ons verified. Chain closed.

Implementation surface

State Layer

store.go — interface + CAS contract
redis.go — Lua scripts + CAS impl
mock.go — mutex-guarded CAS
errors.go — StateConflictError type
store_test.go — conformance suite

Agent Layer

machine.go — ApplyEvent with CAS
machine_test.go — hermetic + concurrent tests
state.go — state constants
transition.go — pure transition table

Supervisor Layer

supervisor.go — CAS conflict handling
supervisor_test.go — backoff + integration
errors.go — ErrInvalidAgentID
export_e2e.go — test exports