Decision #2 — agenKic-orKistrator

Inter-Process Communication

Why gRPC + Redis Streams over Unix sockets, shared memory, or ZeroMQ

The Question

How should agents communicate with the orchestrator and each other? The IPC layer must support type-safe request/response, durable event streams, backpressure, and low latency — all while remaining debuggable. The transport choice determines what the control plane looks like, whether events survive crashes, and how hard it is to trace a message from agent spawn to task completion.

Options Considered

gRPC + Redis Streams
Type-safe proto contracts with Redis for durable event delivery
Chosen
Pros
  • Protobuf codegen enforces contract at compile time across all agent boundaries
  • gRPC streaming provides native backpressure via HTTP/2 flow control
  • Redis consumer groups give each agent an independent, resumable cursor
  • Full audit log of every event without extra instrumentation
Cons
  • 5–20 ms round-trip adds latency versus local sockets
  • Proto compilation step required in every agent build
Unix Domain Sockets
Kernel-mediated local IPC with ~130 µs latency
Partial
Pros
  • Lowest achievable latency for local same-host communication
  • Zero serialization overhead when paired with length-prefixed frames
Cons
  • No schema or contract enforcement — breakage is silent
  • No built-in durability; a crashed consumer loses in-flight messages
Shared Memory
POSIX mmap with semaphore coordination, sub-100 ns
Rejected
Pros
  • Fastest possible transport — sub-100 ns memory copy
Cons
  • POSIX semaphore coordination introduces deadlock surface area
  • Debugging a corrupted shared segment requires kernel-level tooling
  • Unavailable for cross-host or containerised agents
ZeroMQ
Brokerless REQ/REP and PUB/SUB socket patterns
Rejected
Pros
  • Flexible topology — dealer/router enables complex fan-out patterns
  • Low overhead; no central broker required
Cons
  • No schema enforcement; protocol versioning is manual
  • No persistence — slow consumers silently drop messages
NATS
Cloud-native pub/sub with JetStream persistence option
Rejected
Pros
  • Simple subscribe/publish API; JetStream adds durability
  • Mature Go client with good observability hooks
Cons
  • Additional infrastructure dependency alongside Redis
  • Less operational familiarity than Redis within the team
POSIX Message Queues
Kernel-managed mq_open / mq_send with priority support
Rejected
Pros
  • OS-native with no external dependencies
  • Built-in message priority queue semantics
Cons
  • Linux-only — breaks macOS dev and container portability
  • Fixed maximum message size; fragmentation is manual

Latency Comparison

Approximate round-trip latency by transport (log-normalised bars)
Shared Memory
~100 ns
Unix Domain Sockets
~130 µs
gRPC (local, chosen)
5–20 ms
gRPC (network)
20–50 ms

Bars are log-normalised; raw latency spans seven orders of magnitude. The orchestrator's control-plane calls are infrequent enough (task dispatch, status polls) that the 5–20 ms gRPC overhead is acceptable in exchange for schema safety and distributed traceability. Unix sockets are retained for the hot path where sub-millisecond matters.

The Decision

The IPC architecture is split into three complementary layers, each matched to the latency and durability requirements of the messages it carries.

gRPC

Control Plane — gRPC (proto3)

All orchestrator↔agent control messages travel over gRPC: task assignment, heartbeat, cancellation, and result acknowledgement. Protobuf contracts are compiled into Go stubs at build time, so an incompatible change fails the build rather than surfacing as a runtime panic at 3 am. HTTP/2 multiplexing means a single connection carries concurrent streams without head-of-line blocking.

Bidirectional streaming is used for long-running agent sessions; the orchestrator pushes directive updates while the agent streams back progress deltas. gRPC interceptors attach trace_id propagation to every call, giving end-to-end distributed traces across agent boundaries at zero agent-side effort.

Redis

Event Plane — Redis Streams

Asynchronous domain events (agent lifecycle, tool invocations, LLM token usage, audit entries) are written to Redis Streams. Consumer groups give each downstream subscriber its own cursor, so the metrics service, audit logger, and reactive UI feed all consume independently without blocking each other or the agent.

Pending-entry lists and XACK ensure at-least-once delivery even when a consumer crashes mid-processing. Retention is capped via MAXLEN ~ to bound memory while preserving a rolling audit window. Redis is already in the stack for distributed locking (Decision #1), so this is not a new dependency.

UDS

Hot Path — Unix Domain Sockets

A small set of latency-critical paths — notably the local tool-execution bridge and in-process agent sidecar — bypass gRPC and communicate over Unix domain sockets with a minimal length-prefixed framing protocol. This keeps tool-call overhead below 200 µs while the rest of the system benefits from gRPC's observability.

The UDS layer is internal-only; no external agent ever connects to it. The interface is typed via a shared Go struct rather than a proto file, keeping the hot-path evolution separate from the versioned control-plane contract.

Why not collapse all three into one? A single transport would require trading off durability, latency, and schema safety simultaneously. gRPC without Redis means losing the audit log and at-least-once delivery. Redis without gRPC means untyped command dispatch. Unix sockets everywhere means no distributed tracing. The three-layer split is the lowest-complexity design that satisfies all constraints without compromise.

Trade-offs Accepted

Proto compilation build step. Every agent must include protobuf tooling in its build pipeline and regenerate stubs on schema changes. This is manageable with a shared buf.gen.yaml and CI enforcement, but it adds friction for contributors who only want to change business logic.
Redis as an additional runtime dependency. Redis must be available before any agent can receive work. Failure modes (OOM eviction, replication lag) need explicit handling in the orchestrator's health-check loop. Mitigated by the fact that Redis is already required for distributed locking.
Dual IPC pattern complexity. Developers must understand when to use gRPC versus Redis Streams versus Unix sockets. Without clear documentation, new contributors default to the wrong layer. Addressed by codifying the decision boundary in the architecture runbook.
Consumer group management overhead. Adding a new event consumer requires creating its consumer group, handling NOGROUP errors on first start, and configuring retention policy. A helper package wraps this once; all agents import it rather than calling Redis directly.