Decision #2 — agenKic-orKistrator

Inter-Process Communication

Why gRPC + Redis Streams over Unix sockets, shared memory, or ZeroMQ

The Question

How should agents communicate with the orchestrator and each other? The IPC layer must support type-safe request/response, durable event streams, backpressure, and low latency — all while remaining debuggable. The transport choice determines what the control plane looks like, whether events survive crashes, and how hard it is to trace a message from agent spawn to task completion.

Options Considered

gRPC + Redis Streams

Type-safe proto contracts with Redis for durable event delivery

Chosen

Pros

Protobuf codegen enforces contract at compile time across all agent boundaries
gRPC streaming provides native backpressure via HTTP/2 flow control
Redis consumer groups give each agent an independent, resumable cursor
Full audit log of every event without extra instrumentation

Cons

5–20 ms round-trip adds latency versus local sockets
Proto compilation step required in every agent build

Unix Domain Sockets

Kernel-mediated local IPC with ~130 µs latency

Partial

Pros

Lowest achievable latency for local same-host communication
Zero serialization overhead when paired with length-prefixed frames

Cons

No schema or contract enforcement — breakage is silent
No built-in durability; a crashed consumer loses in-flight messages

Shared Memory

POSIX mmap with semaphore coordination, sub-100 ns

Rejected

Pros

Fastest possible transport — sub-100 ns memory copy

Cons

POSIX semaphore coordination introduces deadlock surface area
Debugging a corrupted shared segment requires kernel-level tooling
Unavailable for cross-host or containerised agents

ZeroMQ

Brokerless REQ/REP and PUB/SUB socket patterns

Rejected

Pros

Flexible topology — dealer/router enables complex fan-out patterns
Low overhead; no central broker required

Cons

No schema enforcement; protocol versioning is manual
No persistence — slow consumers silently drop messages

NATS

Cloud-native pub/sub with JetStream persistence option

Rejected

Pros

Simple subscribe/publish API; JetStream adds durability
Mature Go client with good observability hooks

Cons

Additional infrastructure dependency alongside Redis
Less operational familiarity than Redis within the team

POSIX Message Queues

Kernel-managed mq_open / mq_send with priority support

Rejected

Pros

OS-native with no external dependencies
Built-in message priority queue semantics

Cons

Linux-only — breaks macOS dev and container portability
Fixed maximum message size; fragmentation is manual

Latency Comparison

Approximate round-trip latency by transport (log-normalised bars)

Shared Memory

~100 ns

Unix Domain Sockets

~130 µs

gRPC (local, chosen)

5–20 ms

gRPC (network)

20–50 ms

Bars are log-normalised; raw latency spans seven orders of magnitude. The orchestrator's control-plane calls are infrequent enough (task dispatch, status polls) that the 5–20 ms gRPC overhead is acceptable in exchange for schema safety and distributed traceability. Unix sockets are retained for the hot path where sub-millisecond matters.

The Decision

The IPC architecture is split into three complementary layers, each matched to the latency and durability requirements of the messages it carries.

gRPC

Control Plane — gRPC (proto3)

All orchestrator↔agent control messages travel over gRPC: task assignment, heartbeat, cancellation, and result acknowledgement. Protobuf contracts are compiled into Go stubs at build time, so an incompatible change fails the build rather than surfacing as a runtime panic at 3 am. HTTP/2 multiplexing means a single connection carries concurrent streams without head-of-line blocking.

Bidirectional streaming is used for long-running agent sessions; the orchestrator pushes directive updates while the agent streams back progress deltas. gRPC interceptors attach trace_id propagation to every call, giving end-to-end distributed traces across agent boundaries at zero agent-side effort.

Redis

Event Plane — Redis Streams

Asynchronous domain events (agent lifecycle, tool invocations, LLM token usage, audit entries) are written to Redis Streams. Consumer groups give each downstream subscriber its own cursor, so the metrics service, audit logger, and reactive UI feed all consume independently without blocking each other or the agent.

Pending-entry lists and XACK ensure at-least-once delivery even when a consumer crashes mid-processing. Retention is capped via MAXLEN ~ to bound memory while preserving a rolling audit window. Redis is already in the stack for distributed locking (Decision #1), so this is not a new dependency.

UDS

Hot Path — Unix Domain Sockets

A small set of latency-critical paths — notably the local tool-execution bridge and in-process agent sidecar — bypass gRPC and communicate over Unix domain sockets with a minimal length-prefixed framing protocol. This keeps tool-call overhead below 200 µs while the rest of the system benefits from gRPC's observability.

The UDS layer is internal-only; no external agent ever connects to it. The interface is typed via a shared Go struct rather than a proto file, keeping the hot-path evolution separate from the versioned control-plane contract.

Why not collapse all three into one? A single transport would require trading off durability, latency, and schema safety simultaneously. gRPC without Redis means losing the audit log and at-least-once delivery. Redis without gRPC means untyped command dispatch. Unix sockets everywhere means no distributed tracing. The three-layer split is the lowest-complexity design that satisfies all constraints without compromise.

Trade-offs Accepted

Proto compilation build step. Every agent must include protobuf tooling in its build pipeline and regenerate stubs on schema changes. This is manageable with a shared buf.gen.yaml and CI enforcement, but it adds friction for contributors who only want to change business logic.

Redis as an additional runtime dependency. Redis must be available before any agent can receive work. Failure modes (OOM eviction, replication lag) need explicit handling in the orchestrator's health-check loop. Mitigated by the fact that Redis is already required for distributed locking.

Dual IPC pattern complexity. Developers must understand when to use gRPC versus Redis Streams versus Unix sockets. Without clear documentation, new contributors default to the wrong layer. Addressed by codifying the decision boundary in the architecture runbook.

Consumer group management overhead. Adding a new event consumer requires creating its consumer group, handling NOGROUP errors on first start, and configuring retention policy. A helper package wraps this once; all agents import it rather than calling Redis directly.