Task Complexity Classification — Model Gateway Feature • PR #47 • Issue #34
The JudgeRouter wraps a Completer interface and classifies every incoming TaskSpec into one of three tiers. Functional options configure every field at construction time; the zero value is never used directly.
defaultTier with logged reason — never a non-nil errorSequence view showing the caller–router–judge interaction across the three routing outcomes. The completer is only called on the normal classification path.
Haiku-class models / Claude Haiku 4.5
Sonnet-class models / Claude Sonnet 4.6
Opus-class models / Claude Opus 4
All JudgeRouter fields are configurable via the options pattern. NewJudgeRouter() with no
arguments produces a fully usable router with sensible defaults.
router.go:38–40
Sets the model sent as Model in every classification CompletionRequest. The judge model
should be fast and cheap — a Haiku-class model for penny-per-call overhead.
"claude-haiku-4-5-20251001"router.go:44–46
Sets the fallback tier returned on all failure paths: nil completer, bad prompt, judge error, or
unrecognised response. Operators can set this to TierFrontier to fail safe on the
high-quality side.
TierMidrouter.go:49–51
Injects the Completer interface used for judge calls. Accepts any implementation:
LiteLLM client, mock, or direct provider adapter. If nil at Classify time, the router falls back
gracefully rather than panicking.
nil (no-op fallback)router.go:55–57
Overrides the built-in prompt template. Must contain exactly one %s verb and no other
format verbs. Task description is injected via fmt.Sprintf. Enables prompt tuning without
recompiling.
%srouter.go:61–63
Injects a *slog.Logger for all structured routing log output. Avoids mutating the
global slog.Default() logger, making routers safe to use in parallel and in tests
with captured log output.
slog.Default()MaxTokens=10, Temperature=0 — single word response, deterministic, caps output tightly to prevent verbose hallucination
22 test cases across 5 test functions covering every code path in router.go.
All tests use table-driven style with a mockCompleter that captures call counts and
last request.
| Test Function / Case | What it tests | Key assertions | Lines |
|---|---|---|---|
TestJudgeRouter_Classify |
Table-driven main suite — 12 sub-cases below | ||
| override tier skips classification | Valid OverrideTier bypasses completer | tier=frontier, calls=0, reason contains "override", RawResponse="" | 37–49 |
| judge returns cheap | Normal cheap classification path | tier=cheap, calls=1, reason "classified as cheap", RawResponse="cheap" | 50–60 |
| judge returns mid | Normal mid classification path | tier=mid, calls=1, reason "classified as mid", RawResponse="mid" | 61–71 |
| judge returns frontier | Normal frontier classification path | tier=frontier, calls=1, reason "classified as frontier", RawResponse="frontier" | 72–82 |
| uppercase CHEAP | Case-insensitive parseTier via full Classify path | tier=cheap, RawResponse="CHEAP" (original case preserved) | 83–93 |
| garbage response fallback | Unrecognised response falls back to defaultTier | tier=mid, reason "unrecognised response" + "bananas", RawResponse="bananas" | 94–104 |
| completer error fallback | Network error falls back gracefully | no error returned, tier=mid, reason "network timeout", RawResponse="" | 105–115 |
| HTTP 500 error fallback | Provider 500 error falls back gracefully | no error returned, tier=mid, reason "HTTP 500 Internal Server Error" | 116–126 |
| empty string response | Empty content treated as unrecognised, no panic | tier=mid, reason "falling back to default tier", RawResponse="" | 127–137 |
| whitespace-padded frontier | TrimSpace normalisation in parseTier | tier=frontier, RawResponse=" frontier " (original whitespace preserved) | 138–148 |
| override cheap on complex task | OverrideTier=cheap beats complexity of description | tier=cheap, calls=0, reason "override", RawResponse="" | 149–161 |
TestJudgeRouter_ParseTierCaseInsensitive |
Unit tests for parseTier directly — 7 cases | cheap/CHEAP/mid(padded)/Frontier/FRONTIER → correct tier; unknown/"" → invalid | 197–216 |
TestJudgeRouter_NoCompleterUsesDefault |
Nil completer path; no panic | tier=cheap (custom default), RawResponse="", reason "no completer configured" | 218–233 |
TestJudgeRouter_MissingFormatVerb |
Prompt with no %s rejected before completer call | tier=mid, calls=0, reason "exactly one %s verb", RawResponse="" | 235–264 |
TestJudgeRouter_ExcessFormatVerbs |
Prompt with two %s verbs rejected | tier=mid, calls=0, reason "exactly one %s verb and no other format verbs" | 266–295 |
TestJudgeRouter_NonSFormatVerb |
Prompt with %s + %d rejected (non-%s verb) | tier=mid, calls=0, reason "no other format verbs" | 297–326 |
TestJudgeRouter_InvalidOverrideTier |
Invalid OverrideTier ("bogus") warns and falls through to classification | tier=cheap (from judge), calls=1, warn log contains "invalid override tier ignored" + task_id + tier=bogus | 328–378 |
TestJudgeRouter_CustomClassificationPrompt |
WithClassificationPrompt overrides built-in prompt | completer receives prompt containing "Rate this: build a spaceship" | 380–406 |
Five adversarial council review rounds were run before the feature reached FOR verdict. Each round consisted of ADVOCATE, CRITIC, QUESTIONER, and ARBITER roles per the council protocol.
First review of the JudgeRouter implementation. Three required fixes identified before FOR could be issued.
%s validation: reject prompts missing the format verb to prevent silent task description omissionTestJudgeRouter_MissingFormatVerb test case to cover the new guardlog/slog structured logging; no log.Printf remnantsThe Council 1 fix for %s validation was deemed insufficient — it only detected missing verbs but not wrong-type verbs.
%s guard: reject prompts containing non-%s format verbs (e.g., %d, %v) in addition to the missing-verb check%s occurrences as exactly 1, and total % occurrences as exactly 1 (after stripping %% escape sequences)Implementation updated with strengthened validation. Two remaining conditions identified.
%s format verb rejection path (TestJudgeRouter_NonSFormatVerb)TestJudgeRouter_InvalidOverrideTier test: invalid OverrideTier must warn and fall through to classification, not silently use default tierNonSFormatVerb test and InvalidOverrideTier test added. Final condition: existing tests lacked assertions on log output and RawResponse in warn paths.
TestJudgeRouter_InvalidOverrideTier: verify log output contains "invalid override tier ignored", task_id=iot1, tier=bogusRawResponse assertion in the InvalidOverrideTier test (expect "cheap" since judge classifies successfully after warn)All four council rounds’ conditions were satisfied. No new issues found. Follow-on items rolled in to the implementation without blocking the FOR verdict.
TestJudgeRouter_InvalidOverrideTiergo test ./internal/gateway/...var _ Router = (*JudgeRouter)(nil)Four architectural choices that shaped this implementation, each with the rationale that was validated through the council review process.
Motivation: 60–90% cost savings
Without intelligent routing, every task goes to a frontier model (Opus-class). A Haiku-class judge
call costs a fraction of a cent and adds ~200ms p50 latency — negligible against the work model
call. By classifying before routing, simple tasks (formatting, lookups, short summaries) never touch
expensive models. The OverrideTier fast path eliminates judge overhead for callers that
already know the tier, covering latency-critical paths.
Go idiom: accept interfaces, return structs
A flat config struct would require every caller to specify all fields or use a zero-value struct
with surprising defaults. Functional options make defaults explicit and self-documenting at the call
site. Adding a new option (e.g., WithTimeout) is additive and backwards-compatible;
no existing callers break. The pattern also enables per-test configuration without global state
mutation.
Test isolation + production safety
Calling slog.SetDefault in a constructor mutates global state, breaking test
parallelism and leaking log output across router instances. WithLogger scopes the
logger to a single router instance. Tests can inject a slog.New(slog.NewTextHandler(&buf, ...))
to capture and assert on specific log messages without affecting other test cases or the default
logger.
Defence against operator misconfiguration
If an operator supplies a prompt via WithClassificationPrompt without a %s
verb, fmt.Sprintf silently produces a prompt with no task description — the judge
model classifies an empty context. The guard checks that the prompt contains exactly one %s
and no other format verbs (after stripping %% escapes). This is validated before any
completer call so bad configuration is caught immediately, not silently absorbed.