Judge-Router (T4)

Task Complexity Classification — Model Gateway Feature • PR #47 • Issue #34

22 tests pass 100% branch coverage 5 council reviews 166 lines (router.go) 60–90% cost savings target 0 external deps

01Architecture

The JudgeRouter wraps a Completer interface and classifies every incoming TaskSpec into one of three tiers. Functional options configure every field at construction time; the zero value is never used directly.

flowchart TD A([TaskSpec input]) --> B{OverrideTier set\nand valid?} B -- yes --> C[slog.Info\noverride] C --> RD1([RoutingDecision\nTier=override\nRawResponse=""]) B -- invalid override --> WARN1[slog.Warn\ninvalid tier ignored] WARN1 --> B2{completer\n== nil?} B -- no override --> B2 B2 -- yes --> C2[slog.Warn\nno completer] C2 --> RD2([RoutingDecision\nTier=defaultTier\nRawResponse=""]) B2 -- no --> B3{prompt has\nexactly one %s\nno other verbs?} B3 -- no --> C3[slog.Warn\nbad prompt] C3 --> RD3([RoutingDecision\nTier=defaultTier\nRawResponse=""]) B3 -- yes --> D[fmt.Sprintf\nprompt + task.Description] D --> E[completer.Complete\nMaxTokens=10\nTemperature=0] E -- error --> F[slog.Warn\njudge call failed] F --> RD4([RoutingDecision\nTier=defaultTier\nRawResponse=""]) E -- ok --> G[parseTier\nTrimSpace+ToLower] G -- valid tier --> H[slog.Info\nclassified] H --> RD5([RoutingDecision\nTier=tier\nModel=resp.Model\nRawResponse=resp.Content]) G -- unrecognised --> I[slog.Warn\nunrecognised response] I --> RD6([RoutingDecision\nTier=defaultTier\nRawResponse=resp.Content]) subgraph opts [Functional Options] direction TB O1[WithJudgeModel] O2[WithDefaultTier] O3[WithCompleter] O4[WithClassificationPrompt] O5[WithLogger] end opts -.-> D opts -.-> B2 style RD1 fill:#065f46,stroke:#059669 style RD5 fill:#065f46,stroke:#059669 style RD2 fill:#92400e,stroke:#d97706 style RD3 fill:#92400e,stroke:#d97706 style RD4 fill:#92400e,stroke:#d97706 style RD6 fill:#92400e,stroke:#d97706 style opts fill:#1e293b,stroke:#334155

Success paths (green)

  • OverrideTier valid → immediate return, zero completer calls
  • Judge classified successfully → RawResponse populated with raw model output

Fallback paths (amber)

  • Nil completer, bad prompt, judge error, unrecognised response
  • All fallbacks return defaultTier with logged reason — never a non-nil error
  • RawResponse empty except on unrecognised response (raw garbage preserved for debugging)

02Routing Decision Flow

Sequence view showing the caller–router–judge interaction across the three routing outcomes. The completer is only called on the normal classification path.

sequenceDiagram autonumber participant C as Caller participant R as JudgeRouter participant J as Completer (judge) Note over C,R: Path A — OverrideTier set C->>R: Classify(ctx, TaskSpec{OverrideTier: frontier}) R->>R: task.OverrideTier.Valid() == true R-->>C: RoutingDecision{Tier: frontier, Reason: "override: ...", RawResponse: ""} Note over C,R: Path B — Normal classification C->>R: Classify(ctx, TaskSpec{Description: "Design a distributed cache"}) R->>R: no override, completer != nil, prompt valid R->>J: Complete(ctx, {Model: haiku, MaxTokens: 10, Temperature: 0}) J-->>R: CompletionResponse{Content: "frontier"} R->>R: parseTier("frontier") == TierFrontier R-->>C: RoutingDecision{Tier: frontier, Model: haiku, Reason: "judge classified as frontier", RawResponse: "frontier"} Note over C,R: Path C — Classification failure fallback C->>R: Classify(ctx, TaskSpec{Description: "do something"}) R->>J: Complete(ctx, judgeRequest) J-->>R: error("network timeout") R->>R: slog.Warn + fallback to defaultTier R-->>C: RoutingDecision{Tier: mid, Reason: "judge call failed (network timeout); falling back...", RawResponse: ""}

cheap

Haiku-class models / Claude Haiku 4.5

  • Simple lookups and formatting
  • Summarizing short text
  • Straightforward code fixes
  • ~$0.0003 per 1K output tokens

mid

Sonnet-class models / Claude Sonnet 4.6

  • Moderate analysis tasks
  • Code generation
  • Multi-step reasoning
  • Default fallback tier

frontier

Opus-class models / Claude Opus 4

  • Complex architecture decisions
  • Novel problem-solving
  • Long-form creative work
  • Maximum reasoning depth

03Functional Options

All JudgeRouter fields are configurable via the options pattern. NewJudgeRouter() with no arguments produces a fully usable router with sensible defaults.

WithJudgeModel

router.go:38–40

Sets the model sent as Model in every classification CompletionRequest. The judge model should be fast and cheap — a Haiku-class model for penny-per-call overhead.

Default: "claude-haiku-4-5-20251001"

WithDefaultTier

router.go:44–46

Sets the fallback tier returned on all failure paths: nil completer, bad prompt, judge error, or unrecognised response. Operators can set this to TierFrontier to fail safe on the high-quality side.

Default: TierMid

WithCompleter

router.go:49–51

Injects the Completer interface used for judge calls. Accepts any implementation: LiteLLM client, mock, or direct provider adapter. If nil at Classify time, the router falls back gracefully rather than panicking.

Default: nil (no-op fallback)

WithClassificationPrompt

router.go:55–57

Overrides the built-in prompt template. Must contain exactly one %s verb and no other format verbs. Task description is injected via fmt.Sprintf. Enables prompt tuning without recompiling.

Default: built-in three-tier prompt with %s

WithLogger

router.go:61–63

Injects a *slog.Logger for all structured routing log output. Avoids mutating the global slog.Default() logger, making routers safe to use in parallel and in tests with captured log output.

Default: slog.Default()

Built-in Classification Prompt

You are a task complexity classifier. Classify the following task into exactly one of three tiers. Respond with ONLY one word — no punctuation, no explanation: - "cheap" — simple lookups, formatting, summarizing short text, straightforward code fixes - "mid" — moderate analysis, code generation, multi-step reasoning - "frontier" — complex architecture, novel problem-solving, long-form creative work Task: %s

MaxTokens=10, Temperature=0 — single word response, deterministic, caps output tightly to prevent verbose hallucination

04Test Coverage

22 test cases across 5 test functions covering every code path in router.go. All tests use table-driven style with a mockCompleter that captures call counts and last request.

Test Function / Case What it tests Key assertions Lines
TestJudgeRouter_Classify Table-driven main suite — 12 sub-cases below
override tier skips classification Valid OverrideTier bypasses completer tier=frontier, calls=0, reason contains "override", RawResponse="" 37–49
judge returns cheap Normal cheap classification path tier=cheap, calls=1, reason "classified as cheap", RawResponse="cheap" 50–60
judge returns mid Normal mid classification path tier=mid, calls=1, reason "classified as mid", RawResponse="mid" 61–71
judge returns frontier Normal frontier classification path tier=frontier, calls=1, reason "classified as frontier", RawResponse="frontier" 72–82
uppercase CHEAP Case-insensitive parseTier via full Classify path tier=cheap, RawResponse="CHEAP" (original case preserved) 83–93
garbage response fallback Unrecognised response falls back to defaultTier tier=mid, reason "unrecognised response" + "bananas", RawResponse="bananas" 94–104
completer error fallback Network error falls back gracefully no error returned, tier=mid, reason "network timeout", RawResponse="" 105–115
HTTP 500 error fallback Provider 500 error falls back gracefully no error returned, tier=mid, reason "HTTP 500 Internal Server Error" 116–126
empty string response Empty content treated as unrecognised, no panic tier=mid, reason "falling back to default tier", RawResponse="" 127–137
whitespace-padded frontier TrimSpace normalisation in parseTier tier=frontier, RawResponse="  frontier  " (original whitespace preserved) 138–148
override cheap on complex task OverrideTier=cheap beats complexity of description tier=cheap, calls=0, reason "override", RawResponse="" 149–161
TestJudgeRouter_ParseTierCaseInsensitive Unit tests for parseTier directly — 7 cases cheap/CHEAP/mid(padded)/Frontier/FRONTIER → correct tier; unknown/"" → invalid 197–216
TestJudgeRouter_NoCompleterUsesDefault Nil completer path; no panic tier=cheap (custom default), RawResponse="", reason "no completer configured" 218–233
TestJudgeRouter_MissingFormatVerb Prompt with no %s rejected before completer call tier=mid, calls=0, reason "exactly one %s verb", RawResponse="" 235–264
TestJudgeRouter_ExcessFormatVerbs Prompt with two %s verbs rejected tier=mid, calls=0, reason "exactly one %s verb and no other format verbs" 266–295
TestJudgeRouter_NonSFormatVerb Prompt with %s + %d rejected (non-%s verb) tier=mid, calls=0, reason "no other format verbs" 297–326
TestJudgeRouter_InvalidOverrideTier Invalid OverrideTier ("bogus") warns and falls through to classification tier=cheap (from judge), calls=1, warn log contains "invalid override tier ignored" + task_id + tier=bogus 328–378
TestJudgeRouter_CustomClassificationPrompt WithClassificationPrompt overrides built-in prompt completer receives prompt containing "Rate this: build a spaceship" 380–406

05Council Review History

Five adversarial council review rounds were run before the feature reached FOR verdict. Each round consisted of ADVOCATE, CRITIC, QUESTIONER, and ARBITER roles per the council protocol.

Council 1 CONDITIONAL Initial implementation review

First review of the JudgeRouter implementation. Three required fixes identified before FOR could be issued.

  • Add prompt %s validation: reject prompts missing the format verb to prevent silent task description omission
  • Add TestJudgeRouter_MissingFormatVerb test case to cover the new guard
  • Slog housekeeping: ensure all log calls use log/slog structured logging; no log.Printf remnants
Council 2 CONDITIONAL Prompt validation strengthening

The Council 1 fix for %s validation was deemed insufficient — it only detected missing verbs but not wrong-type verbs.

  • Strengthen the %s guard: reject prompts containing non-%s format verbs (e.g., %d, %v) in addition to the missing-verb check
  • The validation must count %s occurrences as exactly 1, and total % occurrences as exactly 1 (after stripping %% escape sequences)
Council 3 CONDITIONAL Non-%s verb rejection + OverrideTier test

Implementation updated with strengthened validation. Two remaining conditions identified.

  • Implement and test the non-%s format verb rejection path (TestJudgeRouter_NonSFormatVerb)
  • Add TestJudgeRouter_InvalidOverrideTier test: invalid OverrideTier must warn and fall through to classification, not silently use default tier
Council 4 CONDITIONAL Warn-leg and RawResponse assertions

NonSFormatVerb test and InvalidOverrideTier test added. Final condition: existing tests lacked assertions on log output and RawResponse in warn paths.

  • Add warn-leg assertions to TestJudgeRouter_InvalidOverrideTier: verify log output contains "invalid override tier ignored", task_id=iot1, tier=bogus
  • Add RawResponse assertion in the InvalidOverrideTier test (expect "cheap" since judge classifies successfully after warn)
Council 5 FOR All conditions satisfied

All four council rounds’ conditions were satisfied. No new issues found. Follow-on items rolled in to the implementation without blocking the FOR verdict.

  • Warn-leg assertions present and passing in TestJudgeRouter_InvalidOverrideTier
  • RawResponse assertions complete across all test cases
  • All 22 tests pass with go test ./internal/gateway/...
  • Compile-time Router interface assertion present: var _ Router = (*JudgeRouter)(nil)

06Key Design Decisions

Four architectural choices that shaped this implementation, each with the rationale that was validated through the council review process.

Judge-then-work pattern

Motivation: 60–90% cost savings

Without intelligent routing, every task goes to a frontier model (Opus-class). A Haiku-class judge call costs a fraction of a cent and adds ~200ms p50 latency — negligible against the work model call. By classifying before routing, simple tasks (formatting, lookups, short summaries) never touch expensive models. The OverrideTier fast path eliminates judge overhead for callers that already know the tier, covering latency-critical paths.

Functional options over config struct

Go idiom: accept interfaces, return structs

A flat config struct would require every caller to specify all fields or use a zero-value struct with surprising defaults. Functional options make defaults explicit and self-documenting at the call site. Adding a new option (e.g., WithTimeout) is additive and backwards-compatible; no existing callers break. The pattern also enables per-test configuration without global state mutation.

WithLogger over slog.SetDefault

Test isolation + production safety

Calling slog.SetDefault in a constructor mutates global state, breaking test parallelism and leaking log output across router instances. WithLogger scopes the logger to a single router instance. Tests can inject a slog.New(slog.NewTextHandler(&buf, ...)) to capture and assert on specific log messages without affecting other test cases or the default logger.

%s format verb validation

Defence against operator misconfiguration

If an operator supplies a prompt via WithClassificationPrompt without a %s verb, fmt.Sprintf silently produces a prompt with no task description — the judge model classifies an empty context. The guard checks that the prompt contains exactly one %s and no other format verbs (after stripping %% escapes). This is validated before any completer call so bad configuration is caught immediately, not silently absorbed.

Data Flow: Classify Path with RawResponse Population

Classify(ctx, task) // Fast path if task.OverrideTier != "" && task.OverrideTier.Valid() slog.InfoContext → return {Tier: override, RawResponse: ""} elif task.OverrideTier != "" slog.WarnContext → warn: invalid tier, continue to classification // Guard rails if r.completer == nil slog.WarnContext → return {Tier: defaultTier, RawResponse: ""} if prompt %s count != 1 || total % count != 1 slog.WarnContext → return {Tier: defaultTier, RawResponse: ""} // Judge call prompt := fmt.Sprintf(classificationPrompt, task.Description) resp, err := r.completer.Complete(ctx, {MaxTokens: 10, Temperature: 0}) if err != nil slog.WarnContext → return {Tier: defaultTier, RawResponse: ""} // Parse + return tier := parseTier(resp.Content) // TrimSpace + ToLower if !tier.Valid() slog.WarnContext → return {Tier: defaultTier, RawResponse: resp.Content} // raw preserved slog.InfoContext → return {Tier: tier, Model: resp.Model, RawResponse: resp.Content} // Note: Classify NEVER returns non-nil error