Why Redis over PostgreSQL, SQLite, or in-memory blackboard
Where does agent state live? AI agents need fast reads for decision-making, durable writes for crash recovery, and pub/sub for real-time UI updates. The state store is the shared memory of the entire system: every agent reads from it on each scheduling tick, every task transition writes to it, and the UI subscribes to it for live progress. A poor choice here either bottlenecks throughput on write contention or loses the audit trail on recovery.
Redis provides three complementary structures. Each maps to a distinct concern in the system and is accessed by a different code path.
One hash per agent. Supervisor reads all agent hashes on each scheduling tick to detect stale heartbeats and re-assign orphaned tasks.
Append-only ordered log of every task lifecycle event. Consumer groups distribute processing across worker agents without duplicate delivery.
Priority-ordered pending task IDs. Supervisor pops the lowest score (highest priority) when an idle agent is available. Score encodes urgency and creation time.
Redis was chosen because its native data structures map one-to-one onto the three distinct access patterns the orchestrator requires. No object-relational mapping layer, no application-side priority logic, no polling loop in place of pub/sub — each pattern resolves to a single Redis primitive.
The supervisor checks agent liveness on every scheduling tick. With a Redis Hash per
agent, HGETALL agent:{id} returns the full agent struct in a single
round-trip at sub-millisecond latency. A PostgreSQL row requires a full query parse,
plan, and disk I/O path even under connection pooling. The difference is not marginal
— it is the difference between a 0.3 ms and a 4 ms tick budget.
Every task state transition — created, assigned, completed, failed — is appended
to the tasks stream. Consumer groups distribute these entries to worker
agents without duplicate processing. On restart, the supervisor replays from the last
acknowledged ID to reconstruct in-flight task state. This is exactly the recovery semantic
that PostgreSQL’s LISTEN/NOTIFY lacks and a blackboard cannot provide at all.
ZPOPMIN task_queue atomically dequeues the highest-priority pending task in
O(log N) time. The score encodes both urgency and insertion order, so ties break
deterministically without application-side sorting. SQLite requires a full
SELECT ... ORDER BY ... LIMIT 1 inside a transaction to achieve the same
invariant, plus a separate DELETE — two round-trips with a write lock held.
The most common consistency concern is double-assignment: two supervisors racing to
assign the same task. Redis WATCH + MULTI/EXEC provides
optimistic locking on the relevant keys. If the transaction aborts (another writer
modified the watched key between WATCH and EXEC), the supervisor retries with fresh
state. This covers the critical path without requiring PostgreSQL’s full serializable
isolation on every read.
MAXLEN on the Stream) must be set and tested before production use.
WATCH/MULTI/EXEC optimistic locking pattern covers the critical assignment
path, but any operation touching more than one key family requires careful design.
Scenarios with complex multi-key invariants may require rethinking the key layout before
implementation.
XAUTOCLAIM. This requires
an explicit dead-letter and redelivery policy in the supervisor loop that a simple
database-backed queue would not need.
agent:{id} Hash → sub-ms liveness checks. tasks
Stream → replayable audit log with consumer group dispatch. task_queue
Sorted Set → atomic priority dequeue. No application-layer data structure
duplicates what Redis already provides natively.