Decision #7 — agenKic-orKistrator

Model Gateway

Why LiteLLM + Judge-Router over OpenRouter, direct APIs, or manual thresholds

The Question

Not every task needs Opus. A smart router matching complexity to capability saves 60–90% on API costs.

The orchestrator dispatches tasks ranging from trivial extraction to novel multi-hop reasoning. Sending everything to the frontier model is a waste; sending complex work to a cheap model produces failures. The gateway layer must classify task complexity, route to the appropriate tier, and fall back gracefully when a provider is unavailable — without coupling business logic to any single API contract.

Options Considered

LiteLLM + Judge-Router

Sidecar proxy with Haiku complexity judge

Chosen

Pros

100+ models behind a single OpenAI-compatible endpoint
RouteLLM methodology validated in production at scale
Haiku judge adds semantic classification, not just token counting
Built-in retry, fallback, and per-model cost tracking

Cons

Sidecar process adds deployment surface and health-check responsibility
Judge misclassification can under-route complex tasks to cheaper models

OpenRouter

Single vendor API aggregator

Rejected

Pros

One API key for all providers
Zero infrastructure to manage

Cons

No local routing logic — all decisions happen in the cloud
Vendor markup on top of provider pricing
No control over retry policies or fallback chains
Cannot classify tasks before choosing a model

Direct Provider APIs

Anthropic, OpenAI, Google SDKs individually

Rejected

Pros

Maximum flexibility and direct access to provider-specific features

Cons

Highest code burden — N provider clients to maintain and version
Fallback logic must be hand-rolled for every call site
No unified cost metering without extra instrumentation
API shape differences leak into business logic

Manual Threshold Routing

Rule-based dispatch on token count or task type tag

Rejected

Pros

Simple to reason about — deterministic rules

Cons

Brittle — misses complexity not captured by token count
No semantic understanding of task difficulty
Requires constant manual tuning as task patterns evolve
Cannot generalise across novel agent workflows

3-Tier Routing

The judge routes each request to exactly one tier. Cost scales logarithmically across tiers; the vast majority of agent work lands in Tier 1.

Tier 1

Haiku

~$0.25 / M tokens

Complexity judge itself
Simple Q&A and lookup
Field extraction and parsing
Short summarisation
Format conversion

cheapest

Tier 2

Sonnet

~$3 / M tokens

Code generation and review
Multi-step analysis
Tool-augmented reasoning
Structured data synthesis
Agent sub-task orchestration

mid

Tier 3

Opus

~$15 / M tokens

Architectural design decisions
Novel research synthesis
Adversarial debate judging
Complex multi-constraint planning
Security and compliance review

frontier

Fallback Chain

LiteLLM handles provider failures transparently. The chain below executes in order; the orchestrator never retries manually.

Provider Fallback Order — per-request

Claude

Primary

HTTP 500

Gemini

Failover

Timeout

Ollama

Local

Fail

Queue

Retry later

Ollama provides a fully local fallback with no API cost — quality degrades but availability is preserved. Tasks that exhaust all providers are enqueued for retry when the primary provider recovers.

The Decision

LiteLLM abstracts 100+ models behind one OpenAI-compatible interface. Business logic calls one endpoint; the gateway handles provider selection, key rotation, retry budgets, and cost metering. Swapping Claude for Gemini at any tier requires a config change, not a code change.

Judge-Router adds intelligence. A Haiku call (cheap, fast) classifies the incoming task before routing. This is the RouteLLM methodology: use a small model to decide which large model to invoke. The judge's misclassification rate is bounded; when in doubt it escalates to the next tier, preserving correctness at the cost of one tier of extra spend.

ABS

Model Abstraction

LiteLLM translates every provider's API surface into the OpenAI chat completions schema. The orchestrator sends one message format and receives one response format, regardless of whether the call went to claude-3-5-haiku, gemini-2.0-flash, or a local llama3.3 instance. Cost and latency metrics are emitted as structured logs per request.

JDG

Judge-Router Intelligence

Before each non-trivial task, a Haiku call evaluates complexity across four axes: reasoning depth, domain specificity, output length, and tool use count. The result is a tier label (cheap / mid / frontier) injected into the LiteLLM routing header. The judge adds ~200 ms and costs ~0.001¢ per classification — negligible against the savings from tier separation.

Simple tasks (extraction, lookup) bypass the judge and are hard-routed to Tier 1 without classification overhead.

RES

Resilience Without Code

LiteLLM's fallbacks config declaratively expresses the provider chain. Retry budgets, exponential backoff, and per-provider circuit breakers are configured in YAML, not in agent code. This means fallback behaviour can be updated at runtime without redeploying agents — a critical property when provider outages require fast reconfiguration.

Trade-offs Accepted

Misclassification risk. The Haiku judge is not perfect. A complex architectural task scored as “mid” will run on Sonnet and may produce a lower-quality output. Mitigation: the orchestrator can inspect output quality signals and escalate on retry; an explicit force_tier header allows callers to override the judge for known-complex tasks.

Sidecar process complexity. LiteLLM runs as a separate process. It requires health checking, graceful restart on config change, and inclusion in the supervisor tree. This is an additional operational surface compared to a pure library call — accepted because the abstraction benefit outweighs the ops cost.

Routing latency overhead. The judge call adds ~200 ms to non-trivial tasks. For low-latency interactive use cases this is measurable. Tasks with a pre-determined tier (e.g., all code-review tasks go to Sonnet) skip the judge entirely and incur zero overhead.

Per-request cost metering dependency. Accurate cost accounting requires that LiteLLM's token counting is correct for each model. Provider token definitions differ subtly; billing surprises are possible until per-model cost curves are validated empirically against actual invoices.

The 60–90% cost saving estimate assumes roughly 70% of tasks route to Tier 1, 25% to Tier 2, and 5% to Tier 3. In practice the split depends on workload composition. Monitor the tier distribution in week one and retune judge thresholds if the Tier 1 rate falls below 60% — that is a signal the judge is over-escalating.