Skip to content
← Back to work
Case Study / Eval Harness

Eval Harness

Defensible cost/quality eval comparing 9 LLMs (5 open-weight on DGX Spark via Ollama, 4 closed-weight via API) across 4 real production tasks from Sift. Cross-vendor judging, hardware-amortized cost, verifiable held-out lock. Routing-decision framework, not a benchmark.

9 models · 4 tasks · planned n=870 across eval setsCross-vendor judging · Bradley-Terry MM rankingVerifiable held-out lock (SHA-256 + git)Hardware-amortized cost on DGX SparkOpen ↗Source ↗
Translation artifact
2026Solo build
01Problem
Situation: Public LLM leaderboards (MMLU, HumanEval, BIG-bench) measure general capability on synthetic tasks. They tell an applied team whether a model is competent in the abstract.
Complication: They do not answer the question that actually drives a production routing decision: for this specific pipeline stage, on this specific corpus, can this specific open-weight model replace the frontier API I'm paying for, and at what cost-per-quality-point trade-off? The only honest way to answer that is to run the actual workload through both and measure with a methodology that holds up under reviewer pressure.
Question: What does an eval need to look like for an applied team to defensibly route LLM calls between local open-weight and frontier APIs — and what does the methodology have to do to survive scrutiny when the conclusions land?
02Requirements
  • Applied team making a routing decision

    Per-task cost/quality framing, not a single composite score. The decision is per pipeline stage — categorization may route differently from summarization.

    Four tasks (categorization, summarization, extraction, RAG), each scored on its own quality + cost frontier; modules A/B implemented, C/D specified and pending

  • Skeptical reviewer

    LLM-as-judge architecture that controls for self-preference bias — Sonnet scoring Sonnet's outputs is unfalsifiable.

    Cross-vendor judging: Sonnet 4.6 judges non-Anthropic-containing pairs (21 of 36); GPT-4o judges Anthropic-containing pairs (15 of 36). 50-pair overlap with inter-judge Cohen's κ reported.

  • Reproducibility-first reader

    Held-out discipline that survives the obvious 'how do I know you didn't peek' question.

    Runner enforces held-out access behind an explicit --include-held-out flag and verifies the set against a committed SHA-256 manifest (implemented in eval-harness); the real 20% Set-1 lock is committed to git before any prompt iteration, at corpus pull. Proof is in the commit history.

  • Procurement / cost-side reader

    Hardware-amortized cost methodology comparable to published API rates — not a hand-wave.

    DGX Spark capex / 3-year useful life + measured wall-clock × FL kWh rate. All assumptions stated; dual production-scale view scoped for v0.3.

03Decision

Production-workload eval with cross-vendor judging architecture, verifiable held-out lock, hardware-amortized cost methodology, Bradley-Terry MM pairwise ranking, and a v0.2 spec critique round before any code was written

chosen
  • meets criterion: Methodological defensibility under reviewer pressure
  • meets criterion: Per-task actionability for routing decisions
  • meets criterion: Reproducibility from commit history alone
  • meets criterion: Reusability across other production systems

The methodology IS the deliverable; the leaderboard is the worked example. Cross-vendor judging eliminates self-preference bias on the pairwise summarization task — the single most common LLM-eval methodology failure. Bradley-Terry MM (Hunter 2004) over all 36 model pairs yields a global strength ranking rather than the asymmetric everyone-vs-Haiku design, which would leave Haiku itself unrankable. Hardware-amortized cost lets local compute be compared to API token pricing on a single axis. The held-out lock — enforced in the runner, which refuses held-out access without an explicit flag and verifies the set against a committed SHA-256 manifest — moves 'I held out 20%' from a vibes claim to a verifiable one. The v0.2 critique round caught nine real methodology issues (judge contamination, scoring conflation on JSON, sample-size power, 70B-on-Task-A throughput infeasibility) before any number was computed, applied them as a tracked diff, and deferred three to v0.3 as post-data-collection decisions.

Extend a public benchmark suite (MMLU + HumanEval + BIG-bench) with cost-per-1M-tokens columns

  • partially meets criterion: Methodological defensibility under reviewer pressure
  • does not meet criterion: Per-task actionability for routing decisions
  • partially meets criterion: Reproducibility from commit history alone
  • partially meets criterion: Reusability across other production systems

Run Sift's pipeline through each model and report aggregate accuracy / cost without cross-vendor judging or held-out controls

  • does not meet criterion: Methodological defensibility under reviewer pressure
  • partially meets criterion: Per-task actionability for routing decisions
  • does not meet criterion: Reproducibility from commit history alone
  • partially meets criterion: Reusability across other production systems
04Solution

A reusable harness (adapter Protocol + task modules + runner) plus a publication-quality methodology page plus a routing-decision framework — with the leaderboard as the worked example, not the primary artifact.

Cross-vendor judging on pairwise summarization
Sonnet 4.6 judges 21 of 36 model pairs (non-Anthropic-containing); GPT-4o judges 15 of 36 (Anthropic-containing). A 50-pair overlap subset is judged by both with inter-judge Cohen's κ reported (caveat triggered if κ < 0.6). Bradley-Terry MM fits a global strength ranking from the full pairwise matrix.
Verifiable held-out lock
Held-out items live in data/holdout/ separately from data/dev/. The runner refuses held-out access without an explicit --include-held-out flag and verifies the set against a committed SHA-256 manifest before scoring, with tamper-detection tests (implemented in eval-harness). The real Set-1 holdout.sha256 is committed before any prompt tuning begins, at corpus pull; the final-scoring run then lets any reviewer verify the hash never moved and that the held-out flag appears only on the final run.
Hardware-amortized cost methodology
Local compute cost = DGX Spark capex / (3 years × 365 × 24 hours) × wall-clock + measured power draw × FL residential kWh rate. API models priced at posted token rates as of run date. Per-task tier split: 70B Q4 reported as quality ceiling but excluded from the deployment cost view because expected DGX Spark throughput is infeasible at Sift's daily article volume.
Reusable adapter + task abstractions
ModelAdapter is a Protocol with one method (complete(prompt, params) → Completion). Tasks are self-contained modules exporting a prompt template, a parser, and a scorer. Swapping the tasks/ directory and pointing at a new dataset is what makes the harness reusable for GridPulse, Tarazu, or any other ML product without re-engineering the runner.
05Outcome
  • Methodology page

    Shipped publication-quality at evals.kristenmartino.ai/methodology

    Cross-vendor judging, hardware-amortized cost, contamination acknowledgement, 10 sections, 9 cited refs (incl. Hunter 2004 BT MM, Panickssery 2024 self-preference)

  • Harness infrastructure

    End-to-end with 47 tests passing

    Adapter Protocol (Ollama + Anthropic + OpenAI + Mock) · task modules (categorization + summarization) · runner with JSONL reproducibility headers + resumability + enforced held-out gate + CLI · Bradley-Terry MM

  • Pre-flight scripts

    5 stdlib-only

    Timing benchmark · length-stratified sampler · category distribution check · API cost estimator · annotation validator

  • v0.2 spec critique

    9 of 11 items applied

    Cross-judge calibration overlap, JSON-validity vs F1 split, 70B tier split, held-out lock mechanism, sample-size power statement; 3 items deferred to v0.3 as post-Task-A decisions

  • Projected v0.2 API spend

    $99.96

    4 closed-weight × 4 tasks + safety + cross-judge overlap, ~3 hours wall-clock at 50 RPM rate limit; Sonnet 4.6 = 69% (candidate + primary judge)

  • Phase 1 leaderboard numbers

    Pending

    Execution begins once Sift corpus is pulled + 70B timing benchmark runs on DGX Spark

Overview

Most leaderboards for open-weight vs frontier LLMs answer a question one level too abstract for an applied team — "how does Llama 3.1 8B score on MMLU vs Sonnet 4.6" — and stop there. The question that determines whether a team actually switches is the next one down: for this specific pipeline stage, on this specific corpus, can this specific open-weight model replace the frontier API I'm paying for, and at what cost-per-quality-point trade-off?

This harness is built to answer that question on a real production workload — Sift's news-pipeline stages (categorization, summarization, structured extraction, grounded RAG) — across nine LLMs (five open-weight on a local DGX Spark via Ollama; four closed-weight via API). The deliverable is a routing-decision framework, not a benchmark leaderboard. The leaderboard is the worked example.

RoleStrategy, experimental design, methodology, harness engineering
Year2026
DomainApplied LLM evaluation · production routing
StackPython · Ollama · Anthropic + OpenAI APIs · Bradley-Terry MM · Next.js (site)
HardwareNVIDIA DGX Spark · 128 GB unified memory · Ollama local inference
StatusMethodology shipped · Phase 1 execution in progress

Problem framing

Three observations shape the design:

  1. Public benchmarks measure general capability; production teams need workload-specific answers. MMLU and HumanEval tell you whether a model is competent in the abstract. They don't tell you whether Qwen 2.5 7B can replace Claude Haiku in your article-categorization step without quality regression. The only way to know that is to run your actual workload through both and measure.
  2. LLM-as-judge methodology is one bad choice away from being unfalsifiable. Pairwise preference judged by Sonnet, on a candidate set that includes Sonnet, produces a number you cannot falsify — judges measurably favor their own outputs (Panickssery et al. 2024). The fix isn't to drop the methodology; it's to architect around it. Cross-vendor judging is the lever.
  3. "I held out 20% for final scoring" is a vibes claim unless the lock is verifiable. Anyone can say they held out the data and didn't peek during prompt iteration. The version that survives scrutiny is the one where the held-out set is SHA-256 hashed and committed to git before any tuning happens, and the runner enforces the gate. Then the proof is in the commit history, not the author's word.

Design

Nine models, three tiers.

  • Open-weight, deployment-feasible (eligible for hybrid-routing recommendations) — Llama 3.1 8B Instruct, Qwen 2.5 7B Instruct, Qwen 2.5 14B Instruct, DeepSeek V2 Lite.
  • Open-weight, quality-ceiling reference — Llama 3.1 70B Q4. Included as an upper bound on what open-weight can achieve on Sift's tasks, but excluded from the deployment cost view because expected DGX Spark throughput (~10–15 tok/s on 3–4K-token articles) is infeasible at Sift's daily volume. A pre-flight timing benchmark confirms or rejects this split before the full eval runs.
  • Closed-weight reference — Claude Haiku 4.5 (Sift's current production model; the bar to beat), Claude Sonnet 4.6 (quality ceiling; also primary judge), GPT-4o (cross-vendor judge), GPT-4o mini.

Four tasks chosen because they're the actual pipeline. Task A is single-label multiclass categorization (the highest-volume step, biggest cost lever). Task B is summarization for the UI (user-facing quality matters most). Task C is structured entity extraction (tests whether open-weight can hold schemas). Task D is grounded RAG with citations (the agentic capability the rest of the industry actually cares about). Plus a 50-prompt safety smoke test for deployment-blocking regressions.

Identical prompt content across models; native chat templates per model. A foreign chat template (Qwen prompts wrapped in Llama 3's role tokens) measurably degrades performance for tokenizer reasons unrelated to capability. So the content is shared, the framing is per-model.

Methodology decisions worth flagging

The credibility of a leaderboard sits in its methodology, not its numbers. Five decisions that earn the most reviewer trust:

  1. Cross-vendor judging architecture. Sonnet 4.6 judges the 21 of 36 model pairs that don't include an Anthropic candidate; GPT-4o judges the 15 of 36 pairs that do. A 50-pair overlap subset is judged by both, with inter-judge Cohen's κ reported (caveat triggered if κ < 0.6). Eliminates the case where Sonnet's outputs are judged by Sonnet — which would otherwise make Sonnet's score unfalsifiable.
  2. Bradley-Terry MM ranking across all 36 pairs, not pairwise-vs-reference. Asymmetric pairwise (everyone-vs-Haiku) makes Haiku unrankable as a candidate. The BT MM algorithm (Hunter 2004) fits a global strength ranking from the full pairwise matrix and lets every model be evaluated against every other.
  3. Hardware-amortized cost methodology, reported alongside published API rates. Local compute cost = DGX Spark capex amortized over a 3-year useful life + measured wall-clock × FL residential kWh rate. All assumptions pinned and stated; a dual view (individual-developer vs production-scale) is deferred to v0.3 for the procurement audience.
  4. Per-task split between schema-adherence and extraction quality. Task C reports two metrics, not one: JSON validity rate (capability A) and entity F1 conditional on validity (capability B). Splitting prevents conflating "weak at JSON" with "weak at extraction" — which the leaderboard would otherwise do silently, because invalid JSON gates everything else.
  5. A v0.2 spec critique round before any code was written. Eleven methodology items surfaced in a structured peer-review pass — judge contamination, pairwise design ambiguity, the 70B-on-Task-A throughput problem, scoring conflation, sample-size power, contamination acknowledgement, and others. Nine applied directly to the spec; three deferred to v0.3 as post-data-collection decisions. The spec critique is the artifact most worth pointing senior reviewers at; it shows the methodology bend-tested before any number was computed.

Pre-flight and the held-out lock

The held-out discipline is what separates a benchmark from a defensible eval, so the mechanism is worth describing concretely:

  • Held-out items live in data/holdout/, separate from data/dev/.
  • The runner refuses held-out access without an explicit --include-held-out flag and verifies the set against a committed holdout.sha256 manifest before scoring, rejecting any set that changed since it was locked. This gate is implemented and enforced today, with tamper-detection tests (eval-harness #1).
  • The real Set-1 holdout.sha256 is committed to git before any prompt iteration begins, at corpus pull (Phase 1); a sample lock ships now to demonstrate the mechanism.
  • The final-scoring run then commits results alongside the unchanged hash file, so any reviewer can verify (a) the hash hasn't changed since the pre-iteration commit, (b) the runner invocation logs include the held-out flag only on the final run.

The proof is in the commit history, not the author's word.

Pre-flight scripts (five, stdlib-only) close the rest of the loop before Phase 1 runs: a timing benchmark on the DGX Spark to confirm the 70B-Task-A tier split, a length-stratified sampler for Task A, a category distribution check that drops any category with fewer than 20 articles (upsampling biases macro-F1), an API cost estimator that prints to the dollar and rate-limit hour, and an annotation validator that catches schema violations before they enter the eval set.

What the routing-decision framework will answer

The leaderboard alone is a snapshot — a model-by-model table at a single point in time. The artifact worth shipping is a routing-decision framework that the leaderboard feeds: for each pipeline stage, the framework will name the model to route to, the conditions under which to do it, the escalation rule when the model's confidence drops, and the failure mode that disqualifies a model regardless of its headline score. Those specific calls — which model, what threshold, what escalation, what kill condition — get committed only after Phase 1 runs and the held-out numbers come back. Until then, the methodology is what's defensible; the decisions live downstream of the data.

Reflections

  • The methodology page is the deliverable; the leaderboard numbers are the worked example. A leaderboard with weak methodology is forgettable. A methodology that holds up under reviewer pressure is the part that gets cited. The site at evals.kristenmartino.ai leads with methodology and treats the leaderboard as the supporting evidence — not the other way around.
  • Pre-execution scoping has been the highest-leverage time spent. The v0.2 critique round caught nine real methodology issues that would have invalidated specific claims if shipped. The cost of catching them after Phase 1 ran would have been a complete rerun. Rigor up-front is cheaper than rework downstream.
  • The harness is reusable by design — adapters and tasks are clean modules. A ModelAdapter is a Protocol with one method (complete(prompt, params) → Completion); a task is a self-contained module that exports a prompt template, a parser, and a scorer. Swapping tasks/ and pointing at a new dataset is what makes this harness work for GridPulse, Tarazu, or any other ML product without re-engineering the runner.
  • Cost methodology is the most-probed section in any technical review pass. Anyone with infrastructure intuition asks first about cost amortization and second about anything else. Pinning DGX Spark capex, FL kWh rate, measured power draw, and useful-life assumption — and stating each as an explicit number rather than a hand-wave — is what makes the cost view defensible.

Closing observation

The framing question isn't "how does open-weight compare to frontier on a benchmark" — that's published already, in many forms, none of them actionable. The framing question is "for this real production system, where does open-weight already meet the bar, and where does it not." The eval-harness is the apparatus that answers that question for Sift specifically — and the methodology, adapter abstractions, and decision-framework structure are what make the same apparatus reusable for the next production system that needs the same answer.

The deliverable a hiring manager remembers is the methodology. The numbers come from running the apparatus the methodology describes.