Skip to content
← Back to work
Case Study / Quantization Study

Quantization Study

Controlled experiment on Llama 3.1 8B at FP16 / Q8_0 / Q4_K_M — paired design across MMLU and CoNLL-2003 NER (n=2,400). Decision framework, not a leaderboard.

3 arms × 800 examples (n=2,400)Paired bootstrap + Holm correctionQ8_0 ≡ FP16 (TOST, ±1pp margin)Q4_K_M: −3.2pp micro-F1 on NER (p_adj=0.032)Source ↗
Translation artifact
2026Solo build
01Problem
Situation: Model quantization is a routine production decision: ship at FP16 and pay for the GPU memory, or ship at Q4 and absorb some quality loss for cheaper, faster inference. Most teams pick by feel.
Complication: Published quantization evaluations report whole-benchmark scores ("Q4 retains 96% of FP16's MMLU") — not actionable for a PM choosing what precision to ship for a specific workload. The published intuition that "Q4 is fine for pattern-matching, hurts on reasoning" turned out to be wrong in this study's data.
Question: Can a single-model, paired-design experiment produce a decision framework that picks the right precision per workload — and characterize how quantization fails, not just whether it does?
02Requirements
  • PM choosing a quantization level

    Cost/quality framing per workload, not a leaderboard. "Ship Q8_0 unless you have a specific reason not to" beats a four-decimal benchmark score.

    The deliverable is a decision framework with explicit pick-Q8 / pick-Q4 / stay-at-FP16 conditions.

  • Skeptical reviewer

    Paired CIs, multiple-comparison correction, and real effect sizes — not point estimates that hide the variance.

    Wilson + paired bootstrap + McNemar + Holm-Bonferroni; CIs reported on every effect size.

  • Anyone reproducing the result

    Deterministic sampling, fixed seed, single-machine reruns; per-example raw outputs shipped alongside summary stats.

    n=2,400 raw JSONL outputs committed; temperature=0; round-robin schedule debiases tok/sec.

  • Failure-mode characterization

    Not just "Q4 is worse" but how it fails — so downstream teams know what to guard against.

    Q4 emits well-formed JSON at a slightly higher parse rate than FP16, but over-extracts — the NER hit is a precision loss (−4.1pp), not recall, concentrated on entity-free sentences.

03Decision

Paired design across two contrasting tasks, paired bootstrap + Holm-corrected pairwise effects, cluster-bootstrap on subjects, decision framework as deliverable

chosen
  • meets criterion: Statistical rigor on paired data
  • meets criterion: Decision-making utility for PMs
  • meets criterion: Reproducibility on commodity hardware
  • meets criterion: Scope honesty

Pairing same items across arms exploits the within-example correlation that independent-sample tests discard — n=499 paired with σ_diff=0.30 detects ≥3.8pp at 80% power, materially tighter than the same n unpaired. Two tasks (MMLU = reasoning/classification; CoNLL-2003 NER = structured extraction) catch the failure mode that a single benchmark hides: Q4's NER regression isn't visible in MMLU. Holm-Bonferroni per task (3 pairwise tests each) controls family-wise error without the over-correction of cross-task Bonferroni. Cluster-bootstrap on subjects for the MMLU overall CI accounts for within-subject correlation that the iid bootstrap underestimates. The decision framework — pick Q8 by default, pick Q4 under specified constraints, stay at FP16 in other specified cases — is what a PM actually needs from this kind of study; leaderboard numbers are not.

Single benchmark (MMLU only), one summary number per arm

  • partially meets criterion: Statistical rigor on paired data
  • does not meet criterion: Decision-making utility for PMs
  • meets criterion: Reproducibility on commodity hardware
  • partially meets criterion: Scope honesty

Two benchmarks, independent-sample CIs across arms

  • partially meets criterion: Statistical rigor on paired data
  • partially meets criterion: Decision-making utility for PMs
  • meets criterion: Reproducibility on commodity hardware
  • meets criterion: Scope honesty
04Solution

Three-arm paired controlled experiment with round-robin scheduling, statistical methodology built for paired data, and a decision framework — not a benchmark leaderboard — as the deliverable.

Paired design with round-robin scheduling
Every example scored across all three arms before moving to the next. Warmups for all arms precede the timed loop; arm order interleaves per example. Debiases tok/sec against thermal and daemon drift that sequential-arm execution silently bakes in. Same items, same prompts, same seed (42).
Two tasks chosen for contrast
MMLU (knowledge/reasoning, four-way multiple choice, stratified across 10 subjects, mechanically scored from the first A–D character) and CoNLL-2003 NER (structured extraction, span-F1 with type-match required, malformed JSON scored as 0). The contrast is the point: one classification task and one structured-extraction task surface different failure modes.
Statistical methodology built for the paired design
Wilson 95% CI for MMLU accuracy; corpus micro-F1 (the canonical CoNLL metric) as the primary NER number, with per-sentence macro-F1 reported alongside as a brittleness view. Pairwise differences via paired bootstrap; p-values via McNemar (binary) or paired bootstrap centered under H0. Holm-Bonferroni per task. Cluster-bootstrap on subjects for the MMLU overall CI. Equivalence (Q8≈FP16) stated as a TOST result against a specified ±1pp practical-equivalence margin, not as the absence of a significant difference.
Decision framework as deliverable
The output is not "Q8 = 0.587 accuracy." The output is "pick Q8_0 as the default; pick Q4_K_M when memory or latency is binding and the workload doesn't include structured information extraction; stay at FP16 when accuracy is the binding constraint." Each pick condition tied to a specific finding with a CI and a corrected p-value.
05Outcome
  • Sample size

    n=2,400 raw outputs

    3 arms × (500 MMLU + 300 NER) examples · committed alongside summary stats

  • Q8_0 vs FP16

    Practically equivalent (TOST, ±1pp)

    MMLU p_TOST=0.019 · NER macro p_TOST=0.002 · a positive equivalence claim, not non-significance

  • Q4_K_M vs FP16 on NER

    −3.2pp micro-F1 (significant)

    canonical CoNLL micro-F1, 95% CI [0.7, 5.7pp], p_adj=0.032 · −5.0pp on per-sentence macro-F1 (p_adj=0.003)

  • Throughput / memory at Q8_0

    1.8× FP16 / 0.53× memory

    Q4_K_M: 2.5× throughput / 0.31× memory · MacBook M4 Max

Overview

Most quantization evaluations report whole-benchmark scores — "Llama 3.1 8B at Q4 retains 96% of FP16's MMLU" — and stop there. That's a leaderboard number. A PM choosing what precision to ship for a feature needs the next layer of question: for which workloads is that 4% loss concentrated, and which workloads pass through unaffected?

This study runs three quantization arms (FP16, Q8_0, Q4_K_M) of Llama 3.1 8B Instruct across two contrasting tasks — MMLU and CoNLL-2003 NER — under a paired design, with the same items, prompts, and seed across every arm. The deliverable is a decision framework, not a leaderboard.

RoleStrategy, experimental design, statistical analysis, writing
Year2026
DomainApplied LLM evaluation
StackPython · Ollama · llama.cpp Metal backend · NumPy · SciPy · HuggingFace datasets
HardwareMacBook Pro · Apple M4 Max · 36 GB unified memory
StatusShipped

Problem framing

Three observations shape the design:

  1. Quantization is a real production decision, and most teams pick by feel. Ship at FP16 and pay for the GPU memory; ship at Q4 and absorb some quality loss for cheaper, faster inference. The cost side is concrete and easy to measure. The quality side is where the hand-waving happens.
  2. Published evaluations are leaderboards, not decision frameworks. "Q4 retains 96% of FP16's MMLU" is one number per arm and doesn't separate the workloads where the loss is concentrated from the workloads where it isn't.
  3. Independent-sample comparisons throw away variance reduction the design is sitting on. Same items run through every arm means the per-example difference is paired data — exploit that with paired CIs and you get materially tighter effects at the same sample size.

Design

Three arms. FP16 (control), Q8_0, Q4_K_M. All pulled from Ollama: llama3.1:8b-instruct-{fp16|q8_0|q4_K_M}.

Two tasks chosen for contrast.

  • MMLU — knowledge/reasoning, four-way multiple choice. Stratified subset of 500 questions across 10 subjects (STEM, humanities, professional). Mechanically scored: parse the first A–D character from the output.
  • CoNLL-2003 NER — structured information extraction. 300 sentences from the test split, model returns a JSON array of (entity, type) pairs, scored by exact span match with type agreement. Reported two ways: corpus micro-F1 (the canonical CoNLL/conlleval metric — pool TP/FP/FN, then F1) as the headline, and per-sentence macro-F1 (mean of per-example F1) as a brittleness view. The gap between them turns out to localize the failure mode.

The contrast is the point. MMLU is closer to "did the model retain the knowledge"; NER is closer to "did the model produce the structured artifact the downstream system expects." Different failure modes surface in different tasks.

Paired design with round-robin scheduling. Every example scored across all three arms before moving on. Warmups for all three arms precede the timed loop, so cold-start latency doesn't bias whichever arm runs first. Without round-robin, an hour of one arm's runtime collects thermal and macOS-daemon-load drift that distorts the tok/sec column.

Deterministic sampling. Temperature 0, fixed seed (42), identical prompts and system messages, max-tokens task-appropriate. Held constant across arms.

Headline result

Quality vs throughput — paired Pareto across three quantization arms
Pareto plot of three quantization arms: FP16 and Q8_0 sit at near-identical quality, with Q8 at roughly 1.8× the throughput. Q4_K_M extends the frontier on throughput but loses quality on NER.

Q8_0 is practically equivalent to FP16 — a positive claim, not just "not significant." Against a specified ±1pp practical-equivalence margin, a TOST clears equivalence on MMLU accuracy (Δ = −0.2pp, p_TOST = 0.019) and on NER per-sentence F1 (Δ = +0.0003, p_TOST = 0.002), and the canonical NER micro-F1 95% CI ([−0.9, +0.2]pp) sits entirely inside ±1pp. All this at ~1.8× the throughput and half the memory footprint.

Q4_K_M loses 3.2pp on canonical corpus micro-F1 (95% CI: [0.7, 5.7pp]; Holm-adjusted p = 0.032), and 5.0pp on per-sentence macro-F1 (CI: [2.1, 8.1pp]; p = 0.003) — significant under both estimands, so the degradation is robust to metric choice. You buy that for ~40% more throughput and an ~3× smaller footprint. The MMLU regression (1.6pp) does not reach significance.

The failure mode

The gap between the two F1 metrics is the tell. Q4's NER hit is over-extraction, not missed entities — decomposing the corpus counts, it loses 4.1pp of precision but only 0.9pp of recall versus FP16. It invents entities rather than dropping them.

That over-extraction concentrates on entity-free sentences. Of the 72 sentences (24% of the corpus) whose correct answer is the empty set, FP16 and Q8_0 correctly return [] ~60% of the time; Q4 only 37.5% — it hallucinates a spurious entity on the rest. This is exactly why macro-F1 (5.0pp) exceeds micro-F1 (3.2pp): per-sentence averaging gives each empty sentence full weight, so one hallucination tanks the whole sentence to zero, while corpus micro-F1 dilutes the same spurious entities across the entity pool.

It matters downstream because it's structurally invisible: Q4 has the highest JSON parse rate of the three arms (99.7%), so schema validation passes — the content is just wrong. Whatever checks the entities has to check the entities; there's no cheaper proxy.

The published intuition that "Q4 is fine for pattern-matching, hurts on reasoning" doesn't hold here. NER is a structured pattern-extraction task, and it's where Q4 visibly degrades; MMLU (more reasoning-heavy) shows a numerically smaller Q4 regression that doesn't reach significance. In this run the degradation reads as a precision/calibration failure — Q4 over-commits to entities, especially where there are none — rather than a reasoning-difficulty one. That's an observed pattern in this data, not a claim about the model's internals.

Statistical methodology

The paired design earns its keep in the analysis:

  • Per-arm CIs. Wilson 95% interval for MMLU accuracy; bootstrap CI on the canonical corpus micro-F1 for NER (resample examples, re-pool TP/FP/FN), with per-sentence macro-F1 reported alongside.
  • Pairwise differences. Paired bootstrap on per-example deltas (macro / accuracy) and on the pooled micro-F1 difference. Same items in both arms means the difference is a real number whose variance is reduced by the pairing.
  • p-values. McNemar's test for binary outcomes (exact binomial for ≤25 discordants; Yates continuity-corrected χ² otherwise); paired bootstrap centered under H0 for the F1 estimands.
  • Equivalence ≠ non-significance. "Q8_0 ≈ FP16" is a TOST (two one-sided tests) against a specified ±1pp practical-equivalence margin — equivalently, the 90% CI lying inside ±0.01. A non-significant difference test would not establish equivalence; the TOST does.
  • Multiple-comparison correction. Holm-Bonferroni applied per task across the three pairwise tests. Per-task framing matches the reader's per-section interpretation; Holm is uniformly more powerful than Bonferroni and doesn't require the BH PRDS assumption (which paired tests within a task violate).
  • Cluster-bootstrap on subjects for the MMLU overall CI. Examples within an MMLU subject are correlated — questions in professional_medicine look like other professional_medicine questions. The iid bootstrap underestimates variance; cluster-bootstrap resamples subject IDs with replacement and pulls all examples from chosen subjects, giving the correct width.
  • Provenance. A committed experiment_manifest.json records the per-arm weight-blob SHA256s, modelfiles, Ollama version, and host — so "only quantization varied across arms" is backed by recorded artifact identity rather than assumed.

Power. At n=499 aligned per arm on MMLU, the design detects ≥3.8pp accuracy differences at 80% power, α=0.05 (paired diff SD = 0.30 observed). At n=300 per arm on NER, ≥4.3pp per-sentence F1 (paired diff SD = 0.27 observed). Both post-hoc verified on the actual data.

Where the effect concentrates

The Q4-vs-FP16 gap on MMLU is concentrated in miscellaneous (−6.0pp) and professional_medicine (−6.0pp), with smaller regressions in moral_scenarios, abstract_algebra, and machine_learning (each −4.0pp). Several subjects move the other way — professional_law (+4.0pp), college_computer_science (+2.0pp), high_school_us_history (+2.0pp) — within the noise band for n=50/subject.

None of the per-subject pairwise differences survives Holm correction within the per-subject family of three tests. This is suggestive heterogeneity worth flagging, not replicable evidence of subject-specific quantization sensitivity. A properly powered subject-level study would need ~3× the per-subject sample size. The pattern is consistent with the published intuition that quantization hits broad-knowledge recall harder than narrow reasoning, but the data here can only suggest it.

A decision framework

What a PM actually needs from a study like this:

Pick Q8_0 — the default.

  • Practically equivalent to FP16 on both tasks (TOST, ±1pp), at ~1.8× throughput. The strongest evidence in the study is for Q8_0 ≡ FP16.
  • Right call when a single model serves heterogeneous workloads.
  • Right call when reasoning depth matters (multi-step, legal, medical) — MMLU shows Q8 holds.

Pick Q4_K_M only when:

  • Throughput is the binding constraint and a measurable F1 hit on structured extraction is acceptable downstream (human review, retry logic, low-stakes outputs).
  • Memory is the binding constraint: Q4_K_M's ~5 GB footprint vs FP16's ~16 GB matters on memory-limited machines (laptops, edge, multi-tenant serving).

Avoid Q4_K_M for structured information extraction. The NER gap is significant under both F1 estimands (3.2pp micro / 5.0pp macro), and it comes from over-extraction — Q4 emits well-formed JSON that adds spurious entities, harder to catch downstream than malformed output: schema validation passes, the content is wrong.

Stay at FP16 when:

  • Accuracy is the binding constraint (regulated outputs, safety-critical, eval ground truth).
  • The throughput delta from Q8 (~1.8×) doesn't move unit economics — inference cost isn't the constraint.

Reflections

  • The single most consequential design choice was pairing. Same items, prompts, and seed across all three arms means the per-example difference is a real number whose variance is reduced by the pairing. Published Q4 evaluations that report unpaired benchmark numbers throw away the variance reduction the design hands them for free. Pairing is what let the NER effect clear significance under the canonical metric (3.2pp micro-F1, p_adj=0.032) at n=300.
  • Round-robin scheduling matters more than it looks. The first draft of the harness ran arms sequentially (all of FP16, then all of Q8, then all of Q4). A senior review pass surfaced the failure mode: thermal drift and macOS daemon load on a laptop are real, and an hour of one arm's runtime collects enough drift to distort the tok/sec column by single-digit percentage points. Round-robin interleaves arms per example; the throughput numbers in the writeup are the cleaner ones.
  • Multiple-comparison correction is where rigor pays. Three pairwise tests per task, two tasks — six tests total. Without correction, one of the marginal MMLU pairs would have been written up as suggestive. With Holm, it's correctly not significant. The corrected p-values are the ones in the README; the uncorrected ones aren't shown.
  • The failure mode is the lede, not the headline number. "Q4 loses 3.2pp on NER" is the number. "Q4 over-extracts — it hallucinates entities on entity-free sentences, so schema validation downstream won't catch it" is the finding. The second is what changes how a team using Q4 should think about their validation pipeline. Putting it in the TL;DR was the right call.
  • A second senior review caught a real metric error — and fixing it made the result stronger, not weaker. I had labeled the NER score "canonical CoNLL metric," but the code computed per-sentence (macro) F1; the canonical conlleval metric is corpus micro-F1. Recomputed from the committed outputs, the Q4 drop is 3.2pp micro (not 5.0pp), still significant — and reporting both metrics is what surfaced the actual mechanism: the macro/micro gap is the over-extraction-on-empty-sentences signal. I also turned "statistically indistinguishable" into a real TOST equivalence claim, added a provenance manifest (model digests) to back the causal framing, and verified one flagged "bug" was a non-issue (the MMLU parser misfire would have corrupted the numbers — checked all 1,500 outputs first). Every corrected number was recomputed from the committed data, with a CI job that now re-derives the summary so the writeup can't silently drift from the evidence.

Closing observation

The unit of value isn't the benchmark score. The unit of value is the decision framework with specific findings tied to specific pick-conditions. A PM reading the TL;DR can pick Q8_0 confidently for most workloads, knows the one workload class to avoid Q4 on, and has a concrete description of how Q4 fails when it does. That's a more useful artifact than a four-decimal benchmark table — and it's the format the work was actually trying to produce.