Judge-Router (T4): Task Complexity Classification

01Architecture

The JudgeRouter wraps a Completer interface and classifies every incoming TaskSpec into one of three tiers. Functional options configure every field at construction time; the zero value is never used directly.

flowchart TD A([TaskSpec input]) --> B{OverrideTier set\nand valid?} B -- yes --> C[slog.Info\noverride] C --> RD1([RoutingDecision\nTier=override\nRawResponse=""]) B -- invalid override --> WARN1[slog.Warn\ninvalid tier ignored] WARN1 --> B2{completer\n== nil?} B -- no override --> B2 B2 -- yes --> C2[slog.Warn\nno completer] C2 --> RD2([RoutingDecision\nTier=defaultTier\nRawResponse=""]) B2 -- no --> B3{prompt has\nexactly one %s\nno other verbs?} B3 -- no --> C3[slog.Warn\nbad prompt] C3 --> RD3([RoutingDecision\nTier=defaultTier\nRawResponse=""]) B3 -- yes --> D[fmt.Sprintf\nprompt + task.Description] D --> E[completer.Complete\nMaxTokens=10\nTemperature=0] E -- error --> F[slog.Warn\njudge call failed] F --> RD4([RoutingDecision\nTier=defaultTier\nRawResponse=""]) E -- ok --> G[parseTier\nTrimSpace+ToLower] G -- valid tier --> H[slog.Info\nclassified] H --> RD5([RoutingDecision\nTier=tier\nModel=resp.Model\nRawResponse=resp.Content]) G -- unrecognised --> I[slog.Warn\nunrecognised response] I --> RD6([RoutingDecision\nTier=defaultTier\nRawResponse=resp.Content]) subgraph opts [Functional Options] direction TB O1[WithJudgeModel] O2[WithDefaultTier] O3[WithCompleter] O4[WithClassificationPrompt] O5[WithLogger] end opts -.-> D opts -.-> B2 style RD1 fill:#065f46,stroke:#059669 style RD5 fill:#065f46,stroke:#059669 style RD2 fill:#92400e,stroke:#d97706 style RD3 fill:#92400e,stroke:#d97706 style RD4 fill:#92400e,stroke:#d97706 style RD6 fill:#92400e,stroke:#d97706 style opts fill:#1e293b,stroke:#334155

Success paths (green)

OverrideTier valid → immediate return, zero completer calls
Judge classified successfully → RawResponse populated with raw model output

Fallback paths (amber)

Nil completer, bad prompt, judge error, unrecognised response
All fallbacks return defaultTier with logged reason — never a non-nil error
RawResponse empty except on unrecognised response (raw garbage preserved for debugging)

02Routing Decision Flow

Sequence view showing the caller–router–judge interaction across the three routing outcomes. The completer is only called on the normal classification path.

sequenceDiagram autonumber participant C as Caller participant R as JudgeRouter participant J as Completer (judge) Note over C,R: Path A — OverrideTier set C->>R: Classify(ctx, TaskSpec{OverrideTier: frontier}) R->>R: task.OverrideTier.Valid() == true R-->>C: RoutingDecision{Tier: frontier, Reason: "override: ...", RawResponse: ""} Note over C,R: Path B — Normal classification C->>R: Classify(ctx, TaskSpec{Description: "Design a distributed cache"}) R->>R: no override, completer != nil, prompt valid R->>J: Complete(ctx, {Model: haiku, MaxTokens: 10, Temperature: 0}) J-->>R: CompletionResponse{Content: "frontier"} R->>R: parseTier("frontier") == TierFrontier R-->>C: RoutingDecision{Tier: frontier, Model: haiku, Reason: "judge classified as frontier", RawResponse: "frontier"} Note over C,R: Path C — Classification failure fallback C->>R: Classify(ctx, TaskSpec{Description: "do something"}) R->>J: Complete(ctx, judgeRequest) J-->>R: error("network timeout") R->>R: slog.Warn + fallback to defaultTier R-->>C: RoutingDecision{Tier: mid, Reason: "judge call failed (network timeout); falling back...", RawResponse: ""}

cheap

Haiku-class models / Claude Haiku 4.5

Simple lookups and formatting
Summarizing short text
Straightforward code fixes
~$0.0003 per 1K output tokens

mid

Sonnet-class models / Claude Sonnet 4.6

Moderate analysis tasks
Code generation
Multi-step reasoning
Default fallback tier

frontier

Opus-class models / Claude Opus 4

Complex architecture decisions
Novel problem-solving
Long-form creative work
Maximum reasoning depth

03Functional Options

All JudgeRouter fields are configurable via the options pattern. NewJudgeRouter() with no arguments produces a fully usable router with sensible defaults.

WithJudgeModel

router.go:38–40

Sets the model sent as Model in every classification CompletionRequest. The judge model should be fast and cheap — a Haiku-class model for penny-per-call overhead.

Default: "claude-haiku-4-5-20251001"

WithDefaultTier

router.go:44–46

Sets the fallback tier returned on all failure paths: nil completer, bad prompt, judge error, or unrecognised response. Operators can set this to TierFrontier to fail safe on the high-quality side.

Default: TierMid

WithCompleter

router.go:49–51

Injects the Completer interface used for judge calls. Accepts any implementation: LiteLLM client, mock, or direct provider adapter. If nil at Classify time, the router falls back gracefully rather than panicking.

Default: nil (no-op fallback)

WithClassificationPrompt

router.go:55–57

Overrides the built-in prompt template. Must contain exactly one %s verb and no other format verbs. Task description is injected via fmt.Sprintf. Enables prompt tuning without recompiling.

Default: built-in three-tier prompt with %s

WithLogger

router.go:61–63

Injects a *slog.Logger for all structured routing log output. Avoids mutating the global slog.Default() logger, making routers safe to use in parallel and in tests with captured log output.

Default: slog.Default()

Built-in Classification Prompt

You are a task complexity classifier. Classify the following task into exactly one of three tiers. Respond with ONLY one word — no punctuation, no explanation: - "cheap" — simple lookups, formatting, summarizing short text, straightforward code fixes - "mid" — moderate analysis, code generation, multi-step reasoning - "frontier" — complex architecture, novel problem-solving, long-form creative work Task: %s

MaxTokens=10, Temperature=0 — single word response, deterministic, caps output tightly to prevent verbose hallucination

04Test Coverage

22 test cases across 5 test functions covering every code path in router.go. All tests use table-driven style with a mockCompleter that captures call counts and last request.

Test Function / Case	What it tests	Key assertions	Lines
`TestJudgeRouter_Classify`	Table-driven main suite — 12 sub-cases below
override tier skips classification	Valid OverrideTier bypasses completer	tier=frontier, calls=0, reason contains "override", RawResponse=""	37–49
judge returns cheap	Normal cheap classification path	tier=cheap, calls=1, reason "classified as cheap", RawResponse="cheap"	50–60
judge returns mid	Normal mid classification path	tier=mid, calls=1, reason "classified as mid", RawResponse="mid"	61–71
judge returns frontier	Normal frontier classification path	tier=frontier, calls=1, reason "classified as frontier", RawResponse="frontier"	72–82
uppercase CHEAP	Case-insensitive parseTier via full Classify path	tier=cheap, RawResponse="CHEAP" (original case preserved)	83–93
garbage response fallback	Unrecognised response falls back to defaultTier	tier=mid, reason "unrecognised response" + "bananas", RawResponse="bananas"	94–104
completer error fallback	Network error falls back gracefully	no error returned, tier=mid, reason "network timeout", RawResponse=""	105–115
HTTP 500 error fallback	Provider 500 error falls back gracefully	no error returned, tier=mid, reason "HTTP 500 Internal Server Error"	116–126
empty string response	Empty content treated as unrecognised, no panic	tier=mid, reason "falling back to default tier", RawResponse=""	127–137
whitespace-padded frontier	TrimSpace normalisation in parseTier	tier=frontier, RawResponse=" frontier " (original whitespace preserved)	138–148
override cheap on complex task	OverrideTier=cheap beats complexity of description	tier=cheap, calls=0, reason "override", RawResponse=""	149–161
`TestJudgeRouter_ParseTierCaseInsensitive`	Unit tests for parseTier directly — 7 cases	cheap/CHEAP/mid(padded)/Frontier/FRONTIER → correct tier; unknown/"" → invalid	197–216
`TestJudgeRouter_NoCompleterUsesDefault`	Nil completer path; no panic	tier=cheap (custom default), RawResponse="", reason "no completer configured"	218–233
`TestJudgeRouter_MissingFormatVerb`	Prompt with no %s rejected before completer call	tier=mid, calls=0, reason "exactly one %s verb", RawResponse=""	235–264
`TestJudgeRouter_ExcessFormatVerbs`	Prompt with two %s verbs rejected	tier=mid, calls=0, reason "exactly one %s verb and no other format verbs"	266–295
`TestJudgeRouter_NonSFormatVerb`	Prompt with %s + %d rejected (non-%s verb)	tier=mid, calls=0, reason "no other format verbs"	297–326
`TestJudgeRouter_InvalidOverrideTier`	Invalid OverrideTier ("bogus") warns and falls through to classification	tier=cheap (from judge), calls=1, warn log contains "invalid override tier ignored" + task_id + tier=bogus	328–378
`TestJudgeRouter_CustomClassificationPrompt`	WithClassificationPrompt overrides built-in prompt	completer receives prompt containing "Rate this: build a spaceship"	380–406

05Council Review History

Five adversarial council review rounds were run before the feature reached FOR verdict. Each round consisted of ADVOCATE, CRITIC, QUESTIONER, and ARBITER roles per the council protocol.

Council 1 CONDITIONAL Initial implementation review

First review of the JudgeRouter implementation. Three required fixes identified before FOR could be issued.

Add prompt %s validation: reject prompts missing the format verb to prevent silent task description omission
Add TestJudgeRouter_MissingFormatVerb test case to cover the new guard
Slog housekeeping: ensure all log calls use log/slog structured logging; no log.Printf remnants

Council 2 CONDITIONAL Prompt validation strengthening

The Council 1 fix for %s validation was deemed insufficient — it only detected missing verbs but not wrong-type verbs.

Strengthen the %s guard: reject prompts containing non-%s format verbs (e.g., %d, %v) in addition to the missing-verb check
The validation must count %s occurrences as exactly 1, and total % occurrences as exactly 1 (after stripping %% escape sequences)

Council 3 CONDITIONAL Non-%s verb rejection + OverrideTier test

Implementation updated with strengthened validation. Two remaining conditions identified.

Implement and test the non-%s format verb rejection path (TestJudgeRouter_NonSFormatVerb)
Add TestJudgeRouter_InvalidOverrideTier test: invalid OverrideTier must warn and fall through to classification, not silently use default tier

Council 4 CONDITIONAL Warn-leg and RawResponse assertions

NonSFormatVerb test and InvalidOverrideTier test added. Final condition: existing tests lacked assertions on log output and RawResponse in warn paths.

Add warn-leg assertions to TestJudgeRouter_InvalidOverrideTier: verify log output contains "invalid override tier ignored", task_id=iot1, tier=bogus
Add RawResponse assertion in the InvalidOverrideTier test (expect "cheap" since judge classifies successfully after warn)

Council 5 FOR All conditions satisfied

All four council rounds’ conditions were satisfied. No new issues found. Follow-on items rolled in to the implementation without blocking the FOR verdict.

Warn-leg assertions present and passing in TestJudgeRouter_InvalidOverrideTier
RawResponse assertions complete across all test cases
All 22 tests pass with go test ./internal/gateway/...
Compile-time Router interface assertion present: var _ Router = (*JudgeRouter)(nil)

06Key Design Decisions

Four architectural choices that shaped this implementation, each with the rationale that was validated through the council review process.

Judge-then-work pattern

Motivation: 60–90% cost savings

Without intelligent routing, every task goes to a frontier model (Opus-class). A Haiku-class judge call costs a fraction of a cent and adds ~200ms p50 latency — negligible against the work model call. By classifying before routing, simple tasks (formatting, lookups, short summaries) never touch expensive models. The OverrideTier fast path eliminates judge overhead for callers that already know the tier, covering latency-critical paths.

Functional options over config struct

Go idiom: accept interfaces, return structs

A flat config struct would require every caller to specify all fields or use a zero-value struct with surprising defaults. Functional options make defaults explicit and self-documenting at the call site. Adding a new option (e.g., WithTimeout) is additive and backwards-compatible; no existing callers break. The pattern also enables per-test configuration without global state mutation.

WithLogger over slog.SetDefault

Test isolation + production safety

Calling slog.SetDefault in a constructor mutates global state, breaking test parallelism and leaking log output across router instances. WithLogger scopes the logger to a single router instance. Tests can inject a slog.New(slog.NewTextHandler(&buf, ...)) to capture and assert on specific log messages without affecting other test cases or the default logger.

%s format verb validation

Defence against operator misconfiguration

If an operator supplies a prompt via WithClassificationPrompt without a %s verb, fmt.Sprintf silently produces a prompt with no task description — the judge model classifies an empty context. The guard checks that the prompt contains exactly one %s and no other format verbs (after stripping %% escapes). This is validated before any completer call so bad configuration is caught immediately, not silently absorbed.

Data Flow: Classify Path with RawResponse Population

Classify(ctx, task)
  // Fast path
  if task.OverrideTier != "" && task.OverrideTier.Valid()
    slog.InfoContext  → return {Tier: override, RawResponse: ""}
  elif task.OverrideTier != ""
    slog.WarnContext  → warn: invalid tier, continue to classification

  // Guard rails
  if r.completer == nil
    slog.WarnContext  → return {Tier: defaultTier, RawResponse: ""}
  if prompt %s count != 1 || total % count != 1
    slog.WarnContext  → return {Tier: defaultTier, RawResponse: ""}

  // Judge call
  prompt := fmt.Sprintf(classificationPrompt, task.Description)
  resp, err := r.completer.Complete(ctx, {MaxTokens: 10, Temperature: 0})
  if err != nil
    slog.WarnContext  → return {Tier: defaultTier, RawResponse: ""}

  // Parse + return
  tier := parseTier(resp.Content)  // TrimSpace + ToLower
  if !tier.Valid()
    slog.WarnContext  → return {Tier: defaultTier, RawResponse: resp.Content}  // raw preserved
  slog.InfoContext    → return {Tier: tier, Model: resp.Model, RawResponse: resp.Content}

  // Note: Classify NEVER returns non-nil error