Epic 3Merged

Model Gateway / Judge-Router

A unified model gateway that routes tasks to the right AI provider based on complexity. A lightweight judge model classifies each task, then routes it to the cheapest capable tier — delivering frontier-quality on complex work while running cheap models on everything else. Target: 60–90% cost savings vs. naive all-frontier routing.

Status

Merged (PR #26)

Features Merged

PRs Open

Routing Tiers

Providers

Cost Savings

60–90%

01 — Architecture

Request Flow

Every task enters through the Gateway interface. The Judge-Router classifies complexity, the config maps tiers to concrete models, the LiteLLM client dispatches via provider adapters, and cost is recorded on every response.

Ctrl/Cmd + wheel to zoom · Scroll to pan · Double-click to fit

Core gateway flow

External providers

Config / cost tracking

02 — Routing Tiers

Three-Tier Model Strategy

The judge model classifies every task into one of three tiers. Each tier has a primary model and a fallback chain configured in config/models.yaml.

Cheap

claude-haiku-4-5

fallback: gpt-4o-mini → ollama/llama3

$0.80 / $4.00 per 1M tokens

Mid

claude-sonnet-4-6

fallback: gpt-4o → gemini-pro

$3.00 / $15.00 per 1M tokens

Frontier

claude-opus-4-6

fallback: gpt-4o → claude-sonnet-4-6

$15.00 / $75.00 per 1M tokens

03 — Lifecycle

Request Lifecycle

Task Arrives

TaskSpec with
description +
metadata

Judge Classifies

Haiku model
returns tier
+ rationale

Config Resolves

Tier maps to
primary model +
fallback chain

Adapter Formats

Provider adapter
translates request
to API shape

LiteLLM Dispatches

POST to
/chat/completions
unified endpoint

Cost Recorded

Model + tokens +
estimated USD
logged per request

04 — Features

Feature Breakdown

Five features cover the full gateway stack. Each has its own spec, branch, and implementation.

Gateway Interface & Types

#32 · feature/32-feature-gateway-interface-types-t1

Shared contracts that all sub-systems build against. Defines Gateway, Router, Completer, and CostTracker interfaces plus all request/response types. No business logic, no I/O — pure type definitions.

internal/gateway/gateway.go
internal/gateway/errors.go

ModelTier enum CompletionRequest CompletionResponse CostRecord TimePeriod

T2 T3

LiteLLM Client & Provider Adapters

#33 · feature/33-feature-litellm-client-provider-adapters-t2-t3

Unified HTTP client for /chat/completions on LiteLLM proxy. Provider adapters translate the canonical request into each API’s format: Anthropic clamps temperature ≤1.0, OpenAI zeros temperature for reasoning models, Ollama strips the ollama/ prefix.

internal/gateway/litellm.go
internal/gateway/providers/provider.go
internal/gateway/providers/anthropic.go
internal/gateway/providers/openai.go
internal/gateway/providers/ollama.go

LiteLLMClient FormatAdapter Functional options Error normalisation

Judge-Router

#34 · feature/34-feature-judge-router-t4

The core routing intelligence. A Haiku-class judge model classifies task complexity in a single cheap call, then routes to the appropriate tier. Supports override tiers for callers that already know the target. Falls back to a configurable default tier on classification failure.

internal/gateway/router.go
internal/gateway/router_test.go

JudgeRouter Classification prompt Default tier fallback Override flag

T5 T6

Config Loader & Fallback Chains

#37 · feature/37-feature-config-loader-fallback-chains-t5-t6

Operator-editable config/models.yaml for tier definitions, provider settings, fallback ordering, and cost tables. FallbackCompleter retries with the next provider in the chain on failure. Validated at startup — no hot-reload.

config/models.yaml
internal/gateway/config.go
internal/gateway/fallback.go

GatewayConfig LoadConfig() ValidateConfig() FallbackCompleter

Cost Tracking

#38 · feature/38-feature-cost-tracking-t7

In-memory cost tracker that logs model, token counts, and estimated USD for every request. Queryable by time period with per-tier aggregation. Thread-safe under concurrent Record calls via sync.RWMutex. Proves the 60–90% savings claim.

internal/gateway/cost.go
internal/gateway/cost_test.go

InMemoryCostTracker EstimateCost() Record() / Report() RWMutex safety

05 — Interfaces

Core Interface Contracts

Interface	Methods	Implementor	Package
`Gateway`	`Route()` `Complete()` `GetCostReport()`	Top-level facade	`internal/gateway`
`Router`	`Route(TaskSpec) (ModelTier, error)`	`JudgeRouter`	`internal/gateway`
`Completer`	`Complete(CompletionRequest) (CompletionResponse, error)`	`LiteLLMClient`, `FallbackCompleter`	`internal/gateway`
`CostTracker`	`Record(CostRecord)` `Report(TimePeriod)`	`InMemoryCostTracker`	`internal/gateway`
`FormatAdapter`	`Name()` `FormatRequest()` `ParseModelName()`	`Anthropic`, `OpenAI`, `Ollama`	`internal/gateway/providers`

06 — Configuration

models.yaml Structure

gateway: litellm_base_url: "http://localhost:4000" timeout_seconds: 30 tiers: cheap: primary_model: "claude-haiku-4-5-20251001" fallback_chain: ["gpt-4o-mini", "ollama/llama3"] mid: primary_model: "claude-sonnet-4-6" fallback_chain: ["gpt-4o", "gemini-pro"] frontier: primary_model: "claude-opus-4-6" fallback_chain: ["gpt-4o", "claude-sonnet-4-6"] providers: # API endpoints per vendor anthropic: { base_url: "https://api.anthropic.com" } openai: { base_url: "https://api.openai.com" } ollama: { base_url: "http://localhost:11434" } cost_per_million_tokens: # pricing table for EstimateCost() claude-haiku-4-5-20251001: { input: 0.80, output: 4.00 } claude-sonnet-4-6: { input: 3.00, output: 15.00 } claude-opus-4-6: { input: 15.00, output: 75.00 }

07 — Dependencies

Task Dependency Graph

Ctrl/Cmd + wheel to zoom · Scroll to pan · Double-click to fit

08 — Risks

Key Risks & Mitigations

Risk	Likelihood	Mitigation
Judge model misclassifies task complexity	Medium	Log routing decisions; human override flag; tune with feedback loop
LiteLLM proxy adds ~10ms latency per call	Low	Acceptable for agent workloads; batch mode amortises for bulk
DeepSeek pricing varies by cache hit/miss (10×)	Medium	Track cache-hit rate separately; alert on unexpected cost spikes
Provider adapter misformats requests	Low	Each adapter has dedicated unit tests; roundtrip integration tests in T9

09 — Exit Criteria

Definition of Done All Met

✓ All Must-Have acceptance scenarios pass in CI
✓ No regressions on Epic 1 (orchestrator core) features
✓ Judge-Router correctly classifies at least 3 task complexity levels
✓ Fallback chains recover from single-provider outage
✓ Cost report shows measurable savings vs. all-frontier baseline
✓ Provider adapters tested for Anthropic, OpenAI, and Ollama formats
✓ Config validated at startup — invalid YAML fails fast with clear error