Decision #7 — agenKic-orKistrator

Model Gateway

Why LiteLLM + Judge-Router over OpenRouter, direct APIs, or manual thresholds

The Question

Not every task needs Opus. A smart router matching complexity to capability saves 60–90% on API costs.

The orchestrator dispatches tasks ranging from trivial extraction to novel multi-hop reasoning. Sending everything to the frontier model is a waste; sending complex work to a cheap model produces failures. The gateway layer must classify task complexity, route to the appropriate tier, and fall back gracefully when a provider is unavailable — without coupling business logic to any single API contract.

Options Considered

LiteLLM + Judge-Router
Sidecar proxy with Haiku complexity judge
Chosen
Pros
  • 100+ models behind a single OpenAI-compatible endpoint
  • RouteLLM methodology validated in production at scale
  • Haiku judge adds semantic classification, not just token counting
  • Built-in retry, fallback, and per-model cost tracking
Cons
  • Sidecar process adds deployment surface and health-check responsibility
  • Judge misclassification can under-route complex tasks to cheaper models
OpenRouter
Single vendor API aggregator
Rejected
Pros
  • One API key for all providers
  • Zero infrastructure to manage
Cons
  • No local routing logic — all decisions happen in the cloud
  • Vendor markup on top of provider pricing
  • No control over retry policies or fallback chains
  • Cannot classify tasks before choosing a model
Direct Provider APIs
Anthropic, OpenAI, Google SDKs individually
Rejected
Pros
  • Maximum flexibility and direct access to provider-specific features
Cons
  • Highest code burden — N provider clients to maintain and version
  • Fallback logic must be hand-rolled for every call site
  • No unified cost metering without extra instrumentation
  • API shape differences leak into business logic
Manual Threshold Routing
Rule-based dispatch on token count or task type tag
Rejected
Pros
  • Simple to reason about — deterministic rules
Cons
  • Brittle — misses complexity not captured by token count
  • No semantic understanding of task difficulty
  • Requires constant manual tuning as task patterns evolve
  • Cannot generalise across novel agent workflows

3-Tier Routing

The judge routes each request to exactly one tier. Cost scales logarithmically across tiers; the vast majority of agent work lands in Tier 1.

Tier 1
Haiku
~$0.25 / M tokens
  • Complexity judge itself
  • Simple Q&A and lookup
  • Field extraction and parsing
  • Short summarisation
  • Format conversion
cheapest
Tier 2
Sonnet
~$3 / M tokens
  • Code generation and review
  • Multi-step analysis
  • Tool-augmented reasoning
  • Structured data synthesis
  • Agent sub-task orchestration
mid
Tier 3
Opus
~$15 / M tokens
  • Architectural design decisions
  • Novel research synthesis
  • Adversarial debate judging
  • Complex multi-constraint planning
  • Security and compliance review
frontier

Fallback Chain

LiteLLM handles provider failures transparently. The chain below executes in order; the orchestrator never retries manually.

Provider Fallback Order — per-request
Claude
Primary
HTTP 500
Gemini
Failover
Timeout
Ollama
Local
Fail
Queue
Retry later

Ollama provides a fully local fallback with no API cost — quality degrades but availability is preserved. Tasks that exhaust all providers are enqueued for retry when the primary provider recovers.

The Decision

LiteLLM abstracts 100+ models behind one OpenAI-compatible interface. Business logic calls one endpoint; the gateway handles provider selection, key rotation, retry budgets, and cost metering. Swapping Claude for Gemini at any tier requires a config change, not a code change.

Judge-Router adds intelligence. A Haiku call (cheap, fast) classifies the incoming task before routing. This is the RouteLLM methodology: use a small model to decide which large model to invoke. The judge's misclassification rate is bounded; when in doubt it escalates to the next tier, preserving correctness at the cost of one tier of extra spend.

ABS

Model Abstraction

LiteLLM translates every provider's API surface into the OpenAI chat completions schema. The orchestrator sends one message format and receives one response format, regardless of whether the call went to claude-3-5-haiku, gemini-2.0-flash, or a local llama3.3 instance. Cost and latency metrics are emitted as structured logs per request.

JDG

Judge-Router Intelligence

Before each non-trivial task, a Haiku call evaluates complexity across four axes: reasoning depth, domain specificity, output length, and tool use count. The result is a tier label (cheap / mid / frontier) injected into the LiteLLM routing header. The judge adds ~200 ms and costs ~0.001¢ per classification — negligible against the savings from tier separation.

Simple tasks (extraction, lookup) bypass the judge and are hard-routed to Tier 1 without classification overhead.

RES

Resilience Without Code

LiteLLM's fallbacks config declaratively expresses the provider chain. Retry budgets, exponential backoff, and per-provider circuit breakers are configured in YAML, not in agent code. This means fallback behaviour can be updated at runtime without redeploying agents — a critical property when provider outages require fast reconfiguration.

Trade-offs Accepted

Misclassification risk. The Haiku judge is not perfect. A complex architectural task scored as “mid” will run on Sonnet and may produce a lower-quality output. Mitigation: the orchestrator can inspect output quality signals and escalate on retry; an explicit force_tier header allows callers to override the judge for known-complex tasks.

Sidecar process complexity. LiteLLM runs as a separate process. It requires health checking, graceful restart on config change, and inclusion in the supervisor tree. This is an additional operational surface compared to a pure library call — accepted because the abstraction benefit outweighs the ops cost.

Routing latency overhead. The judge call adds ~200 ms to non-trivial tasks. For low-latency interactive use cases this is measurable. Tasks with a pre-determined tier (e.g., all code-review tasks go to Sonnet) skip the judge entirely and incur zero overhead.

Per-request cost metering dependency. Accurate cost accounting requires that LiteLLM's token counting is correct for each model. Provider token definitions differ subtly; billing surprises are possible until per-model cost curves are validated empirically against actual invoices.

The 60–90% cost saving estimate assumes roughly 70% of tasks route to Tier 1, 25% to Tier 2, and 5% to Tier 3. In practice the split depends on workload composition. Monitor the tier distribution in week one and retune judge thresholds if the Tier 1 rate falls below 60% — that is a signal the judge is over-escalating.