Decision #3 — agenKic-orKistrator

State Management

Why Redis over PostgreSQL, SQLite, or in-memory blackboard

The Question

Where does agent state live? AI agents need fast reads for decision-making, durable writes for crash recovery, and pub/sub for real-time UI updates. The state store is the shared memory of the entire system: every agent reads from it on each scheduling tick, every task transition writes to it, and the UI subscribes to it for live progress. A poor choice here either bottlenecks throughput on write contention or loses the audit trail on recovery.

Options Considered

Redis (Streams + Hashes + Sorted Sets)

Sub-ms reads, event sourcing via Streams, multiple data structures for task queuing

Chosen

Pros

Sub-millisecond read latency for agent heartbeat and state polling
Streams provide replayable, append-only audit log of all task events
Consumer groups enable competing-consumer agent pools without duplicate work
Sorted Sets give O(log N) priority queue for task scheduling
Hashes map directly to agent state structs — no serialization overhead

Cons

In-memory cost scales with total state size — requires capacity planning
Eventual consistency in cluster mode; single-node is primary target

PostgreSQL

Full ACID relational database with mature Go driver ecosystem

Rejected

Pros

Full ACID transactions eliminate any consistency concern
Mature Go drivers (pgx) and battle-tested in production systems
LISTEN/NOTIFY for basic pub/sub without a second process

Cons

Row-level locking makes high-frequency event log appends a contention hotspot
Disk I/O on every write — unacceptable latency for sub-100ms scheduling ticks
LISTEN/NOTIFY is not a stream; no consumer group, no replay, no cursor

SQLite

Embedded file-backed database — zero infrastructure, file-based persistence

Rejected

Pros

No separate server process — single binary deployment story preserved
Fast reads for small datasets under WAL mode

Cons

Single-writer constraint blocks concurrent agent state updates
No native pub/sub; polling-based change detection only
No priority queue primitive — sorted task queuing requires application logic
Write amplification under WAL with many small high-frequency rows

Blackboard (In-Memory)

Shared Go struct with RWMutex — classical AI blackboard architecture

Rejected

Pros

Lowest possible read latency — direct memory access within the process
Zero infrastructure, zero serialization cost

Cons

All state is lost on process crash — no durability without custom checkpointing
No audit trail; event history is unrecoverable after the fact
Scales to only one process; no multi-node agent distribution path
RWMutex contention increases with agent count

CRDT Store

Conflict-free replicated data types for distributed agent state merging

Rejected

Pros

Merge semantics eliminate write conflicts in distributed agent topologies
Strong eventual consistency without coordination overhead

Cons

Not all task state operations are CRDT-compatible (e.g. priority re-ordering)
Mature Go CRDT libraries are sparse; significant implementation risk
Complexity far exceeds the single-node deployment target

Data Model

Redis provides three complementary structures. Each maps to a distinct concern in the system and is accessed by a different code path.

Hash agent:{id} → current state

One hash per agent. Supervisor reads all agent hashes on each scheduling tick to detect stale heartbeats and re-assign orphaned tasks.

state idle | working | crashed | draining

heartbeat_at Unix timestamp, updated every 5s

current_task task ID or empty string

metadata JSON blob (capabilities, version)

Stream tasks → event log

Append-only ordered log of every task lifecycle event. Consumer groups distribute processing across worker agents without duplicate delivery.

event_type created | assigned | completed | failed

task_id UUID of the affected task

agent_id Agent that produced the event

payload JSON output or error detail

Sorted Set task_queue → pending tasks

Priority-ordered pending task IDs. Supervisor pops the lowest score (highest priority) when an idle agent is available. Score encodes urgency and creation time.

member task ID (UUID string)

score priority × 10¹⁰ + epoch_ns

pop method ZPOPMIN for atomic dequeue

visibility ZRANGE for UI progress view

The Decision

Redis was chosen because its native data structures map one-to-one onto the three distinct access patterns the orchestrator requires. No object-relational mapping layer, no application-side priority logic, no polling loop in place of pub/sub — each pattern resolves to a single Redis primitive.

HSH

Hashes replace struct polling

The supervisor checks agent liveness on every scheduling tick. With a Redis Hash per agent, HGETALL agent:{id} returns the full agent struct in a single round-trip at sub-millisecond latency. A PostgreSQL row requires a full query parse, plan, and disk I/O path even under connection pooling. The difference is not marginal — it is the difference between a 0.3 ms and a 4 ms tick budget.

STR

Streams replace a hand-rolled event log table

Every task state transition — created, assigned, completed, failed — is appended to the tasks stream. Consumer groups distribute these entries to worker agents without duplicate processing. On restart, the supervisor replays from the last acknowledged ID to reconstruct in-flight task state. This is exactly the recovery semantic that PostgreSQL’s LISTEN/NOTIFY lacks and a blackboard cannot provide at all.

ZST

Sorted Sets replace a priority queue implementation

ZPOPMIN task_queue atomically dequeues the highest-priority pending task in O(log N) time. The score encodes both urgency and insertion order, so ties break deterministically without application-side sorting. SQLite requires a full SELECT ... ORDER BY ... LIMIT 1 inside a transaction to achieve the same invariant, plus a separate DELETE — two round-trips with a write lock held.

CAS

Compare-and-Set mitigates the lack of full ACID

The most common consistency concern is double-assignment: two supervisors racing to assign the same task. Redis WATCH + MULTI/EXEC provides optimistic locking on the relevant keys. If the transaction aborts (another writer modified the watched key between WATCH and EXEC), the supervisor retries with fresh state. This covers the critical path without requiring PostgreSQL’s full serializable isolation on every read.

Trade-offs Acknowledged

In-memory cost. All agent state and the full Stream history reside in RAM. For the expected scale of tens to hundreds of agents and thousands of tasks per session, total memory stays well under 100 MB. Long-running deployments accumulate Stream history; a trim policy (MAXLEN on the Stream) must be set and tested before production use.

No full ACID transactions. Redis does not provide serializable isolation across multiple keys. The WATCH/MULTI/EXEC optimistic locking pattern covers the critical assignment path, but any operation touching more than one key family requires careful design. Scenarios with complex multi-key invariants may require rethinking the key layout before implementation.

Additional infrastructure dependency. The project no longer ships as a single binary with zero external dependencies. Redis must be installed, started, and monitored alongside the orchestrator. For the desktop deployment target, this means either bundling a Redis binary or requiring a local Redis installation — a setup step that PostgreSQL or SQLite would not add.

Consumer group management. Redis Streams consumer groups must be created before any consumer reads from them. If a consumer crashes mid-read, its pending entries (PEL) accumulate until explicitly acknowledged or claimed by another consumer via XAUTOCLAIM. This requires an explicit dead-letter and redelivery policy in the supervisor loop that a simple database-backed queue would not need.

Mapping summary: Each of the three Redis structures resolves one access pattern precisely. agent:{id} Hash → sub-ms liveness checks. tasks Stream → replayable audit log with consumer group dispatch. task_queue Sorted Set → atomic priority dequeue. No application-layer data structure duplicates what Redis already provides natively.