Decision #3 — agenKic-orKistrator

State Management

Why Redis over PostgreSQL, SQLite, or in-memory blackboard

The Question

Where does agent state live? AI agents need fast reads for decision-making, durable writes for crash recovery, and pub/sub for real-time UI updates. The state store is the shared memory of the entire system: every agent reads from it on each scheduling tick, every task transition writes to it, and the UI subscribes to it for live progress. A poor choice here either bottlenecks throughput on write contention or loses the audit trail on recovery.

Options Considered

Redis (Streams + Hashes + Sorted Sets)
Sub-ms reads, event sourcing via Streams, multiple data structures for task queuing
Chosen
Pros
  • Sub-millisecond read latency for agent heartbeat and state polling
  • Streams provide replayable, append-only audit log of all task events
  • Consumer groups enable competing-consumer agent pools without duplicate work
  • Sorted Sets give O(log N) priority queue for task scheduling
  • Hashes map directly to agent state structs — no serialization overhead
Cons
  • In-memory cost scales with total state size — requires capacity planning
  • Eventual consistency in cluster mode; single-node is primary target
PostgreSQL
Full ACID relational database with mature Go driver ecosystem
Rejected
Pros
  • Full ACID transactions eliminate any consistency concern
  • Mature Go drivers (pgx) and battle-tested in production systems
  • LISTEN/NOTIFY for basic pub/sub without a second process
Cons
  • Row-level locking makes high-frequency event log appends a contention hotspot
  • Disk I/O on every write — unacceptable latency for sub-100ms scheduling ticks
  • LISTEN/NOTIFY is not a stream; no consumer group, no replay, no cursor
SQLite
Embedded file-backed database — zero infrastructure, file-based persistence
Rejected
Pros
  • No separate server process — single binary deployment story preserved
  • Fast reads for small datasets under WAL mode
Cons
  • Single-writer constraint blocks concurrent agent state updates
  • No native pub/sub; polling-based change detection only
  • No priority queue primitive — sorted task queuing requires application logic
  • Write amplification under WAL with many small high-frequency rows
Blackboard (In-Memory)
Shared Go struct with RWMutex — classical AI blackboard architecture
Rejected
Pros
  • Lowest possible read latency — direct memory access within the process
  • Zero infrastructure, zero serialization cost
Cons
  • All state is lost on process crash — no durability without custom checkpointing
  • No audit trail; event history is unrecoverable after the fact
  • Scales to only one process; no multi-node agent distribution path
  • RWMutex contention increases with agent count
CRDT Store
Conflict-free replicated data types for distributed agent state merging
Rejected
Pros
  • Merge semantics eliminate write conflicts in distributed agent topologies
  • Strong eventual consistency without coordination overhead
Cons
  • Not all task state operations are CRDT-compatible (e.g. priority re-ordering)
  • Mature Go CRDT libraries are sparse; significant implementation risk
  • Complexity far exceeds the single-node deployment target

Data Model

Redis provides three complementary structures. Each maps to a distinct concern in the system and is accessed by a different code path.

Hash agent:{id} → current state

One hash per agent. Supervisor reads all agent hashes on each scheduling tick to detect stale heartbeats and re-assign orphaned tasks.

state idle | working | crashed | draining
heartbeat_at Unix timestamp, updated every 5s
current_task task ID or empty string
metadata JSON blob (capabilities, version)
Stream tasks → event log

Append-only ordered log of every task lifecycle event. Consumer groups distribute processing across worker agents without duplicate delivery.

event_type created | assigned | completed | failed
task_id UUID of the affected task
agent_id Agent that produced the event
payload JSON output or error detail
Sorted Set task_queue → pending tasks

Priority-ordered pending task IDs. Supervisor pops the lowest score (highest priority) when an idle agent is available. Score encodes urgency and creation time.

member task ID (UUID string)
score priority × 1010 + epoch_ns
pop method ZPOPMIN for atomic dequeue
visibility ZRANGE for UI progress view

The Decision

Redis was chosen because its native data structures map one-to-one onto the three distinct access patterns the orchestrator requires. No object-relational mapping layer, no application-side priority logic, no polling loop in place of pub/sub — each pattern resolves to a single Redis primitive.

HSH

Hashes replace struct polling

The supervisor checks agent liveness on every scheduling tick. With a Redis Hash per agent, HGETALL agent:{id} returns the full agent struct in a single round-trip at sub-millisecond latency. A PostgreSQL row requires a full query parse, plan, and disk I/O path even under connection pooling. The difference is not marginal — it is the difference between a 0.3 ms and a 4 ms tick budget.

STR

Streams replace a hand-rolled event log table

Every task state transition — created, assigned, completed, failed — is appended to the tasks stream. Consumer groups distribute these entries to worker agents without duplicate processing. On restart, the supervisor replays from the last acknowledged ID to reconstruct in-flight task state. This is exactly the recovery semantic that PostgreSQL’s LISTEN/NOTIFY lacks and a blackboard cannot provide at all.

ZST

Sorted Sets replace a priority queue implementation

ZPOPMIN task_queue atomically dequeues the highest-priority pending task in O(log N) time. The score encodes both urgency and insertion order, so ties break deterministically without application-side sorting. SQLite requires a full SELECT ... ORDER BY ... LIMIT 1 inside a transaction to achieve the same invariant, plus a separate DELETE — two round-trips with a write lock held.

CAS

Compare-and-Set mitigates the lack of full ACID

The most common consistency concern is double-assignment: two supervisors racing to assign the same task. Redis WATCH + MULTI/EXEC provides optimistic locking on the relevant keys. If the transaction aborts (another writer modified the watched key between WATCH and EXEC), the supervisor retries with fresh state. This covers the critical path without requiring PostgreSQL’s full serializable isolation on every read.

Trade-offs Acknowledged

In-memory cost.  All agent state and the full Stream history reside in RAM. For the expected scale of tens to hundreds of agents and thousands of tasks per session, total memory stays well under 100 MB. Long-running deployments accumulate Stream history; a trim policy (MAXLEN on the Stream) must be set and tested before production use.
No full ACID transactions.  Redis does not provide serializable isolation across multiple keys. The WATCH/MULTI/EXEC optimistic locking pattern covers the critical assignment path, but any operation touching more than one key family requires careful design. Scenarios with complex multi-key invariants may require rethinking the key layout before implementation.
Additional infrastructure dependency.  The project no longer ships as a single binary with zero external dependencies. Redis must be installed, started, and monitored alongside the orchestrator. For the desktop deployment target, this means either bundling a Redis binary or requiring a local Redis installation — a setup step that PostgreSQL or SQLite would not add.
Consumer group management.  Redis Streams consumer groups must be created before any consumer reads from them. If a consumer crashes mid-read, its pending entries (PEL) accumulate until explicitly acknowledged or claimed by another consumer via XAUTOCLAIM. This requires an explicit dead-letter and redelivery policy in the supervisor loop that a simple database-backed queue would not need.
Mapping summary: Each of the three Redis structures resolves one access pattern precisely. agent:{id} Hash → sub-ms liveness checks. tasks Stream → replayable audit log with consumer group dispatch. task_queue Sorted Set → atomic priority dequeue. No application-layer data structure duplicates what Redis already provides natively.