Why LiteLLM + Judge-Router over OpenRouter, direct APIs, or manual thresholds
Not every task needs Opus. A smart router matching complexity to capability saves 60–90% on API costs.
The orchestrator dispatches tasks ranging from trivial extraction to novel multi-hop reasoning. Sending everything to the frontier model is a waste; sending complex work to a cheap model produces failures. The gateway layer must classify task complexity, route to the appropriate tier, and fall back gracefully when a provider is unavailable — without coupling business logic to any single API contract.
The judge routes each request to exactly one tier. Cost scales logarithmically across tiers; the vast majority of agent work lands in Tier 1.
LiteLLM handles provider failures transparently. The chain below executes in order; the orchestrator never retries manually.
Ollama provides a fully local fallback with no API cost — quality degrades but availability is preserved. Tasks that exhaust all providers are enqueued for retry when the primary provider recovers.
LiteLLM abstracts 100+ models behind one OpenAI-compatible interface. Business logic calls one endpoint; the gateway handles provider selection, key rotation, retry budgets, and cost metering. Swapping Claude for Gemini at any tier requires a config change, not a code change.
Judge-Router adds intelligence. A Haiku call (cheap, fast) classifies the incoming task before routing. This is the RouteLLM methodology: use a small model to decide which large model to invoke. The judge's misclassification rate is bounded; when in doubt it escalates to the next tier, preserving correctness at the cost of one tier of extra spend.
LiteLLM translates every provider's API surface into the OpenAI chat completions
schema. The orchestrator sends one message format and receives one response format,
regardless of whether the call went to claude-3-5-haiku,
gemini-2.0-flash, or a local llama3.3 instance.
Cost and latency metrics are emitted as structured logs per request.
Before each non-trivial task, a Haiku call evaluates complexity across four axes:
reasoning depth, domain specificity, output length, and tool use count. The result
is a tier label (cheap / mid / frontier)
injected into the LiteLLM routing header. The judge adds ~200 ms and costs ~0.001¢
per classification — negligible against the savings from tier separation.
Simple tasks (extraction, lookup) bypass the judge and are hard-routed to Tier 1 without classification overhead.
LiteLLM's fallbacks config declaratively expresses the provider chain.
Retry budgets, exponential backoff, and per-provider circuit breakers are configured
in YAML, not in agent code. This means fallback behaviour can be updated at runtime
without redeploying agents — a critical property when provider outages require
fast reconfiguration.
Misclassification risk. The Haiku judge is not perfect. A
complex architectural task scored as “mid” will run on Sonnet and may
produce a lower-quality output. Mitigation: the orchestrator can inspect output
quality signals and escalate on retry; an explicit force_tier header
allows callers to override the judge for known-complex tasks.
Sidecar process complexity. LiteLLM runs as a separate process. It requires health checking, graceful restart on config change, and inclusion in the supervisor tree. This is an additional operational surface compared to a pure library call — accepted because the abstraction benefit outweighs the ops cost.
Routing latency overhead. The judge call adds ~200 ms to non-trivial tasks. For low-latency interactive use cases this is measurable. Tasks with a pre-determined tier (e.g., all code-review tasks go to Sonnet) skip the judge entirely and incur zero overhead.
Per-request cost metering dependency. Accurate cost accounting requires that LiteLLM's token counting is correct for each model. Provider token definitions differ subtly; billing surprises are possible until per-model cost curves are validated empirically against actual invoices.