Epic 3Merged

Model Gateway / Judge-Router

A unified model gateway that routes tasks to the right AI provider based on complexity. A lightweight judge model classifies each task, then routes it to the cheapest capable tier — delivering frontier-quality on complex work while running cheap models on everything else. Target: 60–90% cost savings vs. naive all-frontier routing.

Status
Merged (PR #26)
Features Merged
5
PRs Open
0
Routing Tiers
3
Providers
4
Cost Savings
60–90%
01 — Architecture

Request Flow

Every task enters through the Gateway interface. The Judge-Router classifies complexity, the config maps tiers to concrete models, the LiteLLM client dispatches via provider adapters, and cost is recorded on every response.

Ctrl/Cmd + wheel to zoom · Scroll to pan · Double-click to fit

Loading...
Core gateway flow
External providers
Config / cost tracking
02 — Routing Tiers

Three-Tier Model Strategy

The judge model classifies every task into one of three tiers. Each tier has a primary model and a fallback chain configured in config/models.yaml.

Cheap
claude-haiku-4-5
fallback: gpt-4o-mini → ollama/llama3
$0.80 / $4.00 per 1M tokens
Mid
claude-sonnet-4-6
fallback: gpt-4o → gemini-pro
$3.00 / $15.00 per 1M tokens
Frontier
claude-opus-4-6
fallback: gpt-4o → claude-sonnet-4-6
$15.00 / $75.00 per 1M tokens
03 — Lifecycle

Request Lifecycle

1
Task Arrives
TaskSpec with
description +
metadata
2
Judge Classifies
Haiku model
returns tier
+ rationale
3
Config Resolves
Tier maps to
primary model +
fallback chain
4
Adapter Formats
Provider adapter
translates request
to API shape
5
LiteLLM Dispatches
POST to
/chat/completions
unified endpoint
6
Cost Recorded
Model + tokens +
estimated USD
logged per request
04 — Features

Feature Breakdown

Five features cover the full gateway stack. Each has its own spec, branch, and implementation.

T1
Gateway Interface & Types
#32 · feature/32-feature-gateway-interface-types-t1
Shared contracts that all sub-systems build against. Defines Gateway, Router, Completer, and CostTracker interfaces plus all request/response types. No business logic, no I/O — pure type definitions.
  • internal/gateway/gateway.go
  • internal/gateway/errors.go
ModelTier enum CompletionRequest CompletionResponse CostRecord TimePeriod
T2 T3
LiteLLM Client & Provider Adapters
#33 · feature/33-feature-litellm-client-provider-adapters-t2-t3
Unified HTTP client for /chat/completions on LiteLLM proxy. Provider adapters translate the canonical request into each API’s format: Anthropic clamps temperature ≤1.0, OpenAI zeros temperature for reasoning models, Ollama strips the ollama/ prefix.
  • internal/gateway/litellm.go
  • internal/gateway/providers/provider.go
  • internal/gateway/providers/anthropic.go
  • internal/gateway/providers/openai.go
  • internal/gateway/providers/ollama.go
LiteLLMClient FormatAdapter Functional options Error normalisation
T4
Judge-Router
#34 · feature/34-feature-judge-router-t4
The core routing intelligence. A Haiku-class judge model classifies task complexity in a single cheap call, then routes to the appropriate tier. Supports override tiers for callers that already know the target. Falls back to a configurable default tier on classification failure.
  • internal/gateway/router.go
  • internal/gateway/router_test.go
JudgeRouter Classification prompt Default tier fallback Override flag
T5 T6
Config Loader & Fallback Chains
#37 · feature/37-feature-config-loader-fallback-chains-t5-t6
Operator-editable config/models.yaml for tier definitions, provider settings, fallback ordering, and cost tables. FallbackCompleter retries with the next provider in the chain on failure. Validated at startup — no hot-reload.
  • config/models.yaml
  • internal/gateway/config.go
  • internal/gateway/fallback.go
GatewayConfig LoadConfig() ValidateConfig() FallbackCompleter
T7
Cost Tracking
#38 · feature/38-feature-cost-tracking-t7
In-memory cost tracker that logs model, token counts, and estimated USD for every request. Queryable by time period with per-tier aggregation. Thread-safe under concurrent Record calls via sync.RWMutex. Proves the 60–90% savings claim.
  • internal/gateway/cost.go
  • internal/gateway/cost_test.go
InMemoryCostTracker EstimateCost() Record() / Report() RWMutex safety
05 — Interfaces

Core Interface Contracts

Interface Methods Implementor Package
Gateway Route() Complete() GetCostReport() Top-level facade internal/gateway
Router Route(TaskSpec) (ModelTier, error) JudgeRouter internal/gateway
Completer Complete(CompletionRequest) (CompletionResponse, error) LiteLLMClient, FallbackCompleter internal/gateway
CostTracker Record(CostRecord) Report(TimePeriod) InMemoryCostTracker internal/gateway
FormatAdapter Name() FormatRequest() ParseModelName() Anthropic, OpenAI, Ollama internal/gateway/providers
06 — Configuration

models.yaml Structure

config/models.yaml
gateway: litellm_base_url: "http://localhost:4000" timeout_seconds: 30 tiers: cheap: primary_model: "claude-haiku-4-5-20251001" fallback_chain: ["gpt-4o-mini", "ollama/llama3"] mid: primary_model: "claude-sonnet-4-6" fallback_chain: ["gpt-4o", "gemini-pro"] frontier: primary_model: "claude-opus-4-6" fallback_chain: ["gpt-4o", "claude-sonnet-4-6"] providers: # API endpoints per vendor anthropic: { base_url: "https://api.anthropic.com" } openai: { base_url: "https://api.openai.com" } ollama: { base_url: "http://localhost:11434" } cost_per_million_tokens: # pricing table for EstimateCost() claude-haiku-4-5-20251001: { input: 0.80, output: 4.00 } claude-sonnet-4-6: { input: 3.00, output: 15.00 } claude-opus-4-6: { input: 15.00, output: 75.00 }
07 — Dependencies

Task Dependency Graph

Ctrl/Cmd + wheel to zoom · Scroll to pan · Double-click to fit

Loading...
08 — Risks

Key Risks & Mitigations

Risk Likelihood Mitigation
Judge model misclassifies task complexity Medium Log routing decisions; human override flag; tune with feedback loop
LiteLLM proxy adds ~10ms latency per call Low Acceptable for agent workloads; batch mode amortises for bulk
DeepSeek pricing varies by cache hit/miss (10×) Medium Track cache-hit rate separately; alert on unexpected cost spikes
Provider adapter misformats requests Low Each adapter has dedicated unit tests; roundtrip integration tests in T9
09 — Exit Criteria

Definition of Done All Met