Skip to content
← Back to work
Case Study / Valuate

Valuate

AI-augmented DCF agent — extracts financial line items from SEC 10-Ks via XBRL and Claude, then runs a Monte Carlo valuation with cell-level source attribution.

10 S&P 500 tickersTwo-track extraction (XBRL + LLM)Source quotes + HITL review10K Monte Carlo iterationsOpen ↗Source ↗
Translation artifact
2026Solo build
01Problem
Situation: Automated valuation systems claim to read 10-Ks and produce DCF models. The math is trivial — fifty lines of code — but reliable line-item extraction from filings written for human readers is the actual bottleneck.
Complication: Most automated-valuation demos work on clean industrial mid-caps without saying so. Filer inconsistency in XBRL tagging is the rule, not the exception — Apple uses one revenue concept, Caterpillar another; some filers don't tag operating income at all. Black-box extraction at this scale produces fair-value estimates a finance reviewer cannot verify against the underlying document.
Question: Can an extraction agent acknowledge its scope and surface its uncertainty — rather than hiding both?
02Requirements
  • Finance reviewer

    Per-cell source attribution back to the filing — verbatim quotes for each Claude-extracted line item, no bare numbers.

    HITL review surface verifies in one click

  • Coverage breadth

    Handle the ~30% of clean-reporting S&P 500 filers with at least one untagged or non-canonical line item — without failing the request.

    3 of 10 hand-picked tickers needed Track B or derivation

  • Audit trail

    Persist user overrides as first-class data with source=USER_OVERRIDE so a corrected valuation traces back to who corrected what.

  • Modeling latency

    Sliders adjusting growth, margin, terminal growth, WACC must recompute Monte Carlo + sensitivity grid faster than human reaction time.

    Full recompute in under 200 ms

03Decision

Two-track extraction with deterministic derivation backstop and HITL review surface

chosen
  • meets criterion: Coverage on clean filers
  • meets criterion: Verifiability per cell
  • meets criterion: Failure transparency
  • partially meets criterion: Build effort

XBRL queries first (Track A), Claude over the 10-K HTML where Track A leaves gaps (Track B), accounting-identity derivation as the fallback (operating income from income_before_tax + interest_expense; total liabilities from total_assets − shareholders_equity). Every value carries provenance — XBRL concept, verbatim quote, or formula — and the HITL surface makes verification one click instead of a manual hunt. The 3× build cost is justified by the demand: a fair-value estimate without provenance is not actionable for the finance reviewer who has to defend the number.

XBRL-only extraction (canonical concepts, fail on missing)

  • partially meets criterion: Coverage on clean filers
  • meets criterion: Verifiability per cell
  • does not meet criterion: Failure transparency
  • meets criterion: Build effort

LLM-only over the full 10-K HTML

  • partially meets criterion: Coverage on clean filers
  • meets criterion: Verifiability per cell
  • partially meets criterion: Failure transparency
  • meets criterion: Build effort
04Solution

An extraction agent that runs XBRL first, Claude over the 10-K HTML for gaps, and accounting-identity derivation as the backstop — with cell-level source attribution and a one-click HITL verification surface.

Track A — XBRL company facts
SEC pre-tagged JSON queried for canonical us-gaap concepts. Restatements deduplicated by period-end date (the gotcha: filing-fiscal-year groups three years of comparatives into one slot). Returns whatever it can find and hands the gaps to Track B.
Track B — Claude over 10-K HTML
Item 8 (Financial Statements) sliced by anchor pattern, sent with a confidence-calibrated prompt, parsed into the same LineItem schema. Every value carries a verbatim 5–30 word source quote. Static prefix marked for prompt caching.
Derivation backstop
Accounting identities for fields neither track tagged (operating income for filers like JNJ/NKE; total liabilities for filers like NKE/KO). source=DERIVED carries a synthetic quote describing the formula — provenance survives the inference.
HITL review + override persistence
Low-confidence items (<0.80) and balance-sheet identity violations (>50bps) flagged in the surface. Overrides persist as first-class LineItem entries with source=USER_OVERRIDE and re-trigger validation on each write.
05Outcome
  • Universe

    10 S&P 500 tickers

    Clean industrial filers; banks/insurers/E&P intentionally V2

  • Track-B + derivation

    3 of 10 needed

    Even hand-picked filers leave gaps Track A can't fill

  • Recompute

    <200 ms

    10K Monte Carlo + 7×7 sensitivity grid

  • Provenance

    Per-cell, every value

    XBRL concept, verbatim quote, or formula

Overview

Most "AI reads financial statements" demos quietly limit themselves to clean industrial mid-caps with standard reporting — without saying so. The hard part of automated valuation is not the math, it's getting reliable structured data out of filings written for human readers. Valuate makes that scope choice explicit and builds verification into the agent flow rather than hiding extraction errors.

The product extracts line items from a company's most recent 10-K, lets the user adjust forward-looking assumptions, and produces a Monte Carlo DCF valuation with cell-level source attribution back to the filing.

RoleStrategy, design, and engineering (frontend + backend)
Year2026
DomainAI-assisted financial analysis
StackNext.js · FastAPI · LangGraph · Claude (claude-sonnet-4-6) · SEC EDGAR
StatusShipped

Problem framing

Three observations shape the design:

  1. The extraction problem is the bottleneck, not the modeling. A textbook DCF takes ~50 lines of code. Producing it from a real 10-K requires reliably mapping each filer's idiosyncratic XBRL tags or HTML structure to a canonical schema — and that's where most automated-valuation systems quietly fail.
  2. Black-box extraction is unverifiable. A fair-value estimate that arrives without source attribution is not actionable. Whether the number is reasonable depends on whether each line item it rests on came from where the user expected.
  3. The universe of edge cases is the universe. Banks, insurers, REITs, and energy E&P companies all break the standard mid-cap-industrial template — sometimes by reporting on fundamentally different line items (banks have interest-spread P&Ls, REITs have real-estate-at-cost balance sheets), sometimes by needing a different valuation entirely (E&P reserves deplete, so a Gordon-growth terminal is conceptually wrong even though the line items match). Pretending one extraction-and-valuation pipeline works for all of them is the standard demo's failure mode.

Solution

The agent extracts each line item through one of two tracks, with confidence calibration and a deterministic derivation step for fields that neither track can fill. At ingest time the filer's SIC code routes the rest of the pipeline through one of five industry paths — industrials/tech, banks, insurers, REITs, and energy E&P — each with its own valuation flavor and (for the first four) its own Pydantic schema variant on the wire. Industrials and tech share the canonical schema and a 5-year FCFF DCF; the other four are described below.

Track A — XBRL company facts

SEC filers tag financial statements with us-gaap concepts (Revenues, OperatingIncomeLoss, NetIncomeLoss, ...). Track A queries SEC's pre-tagged company-facts JSON, deduplicates restatements by matching on period-end date, and returns whatever it can find. Filer inconsistency is the rule, not the exception — Apple uses RevenueFromContractWithCustomerExcludingAssessedTax, Caterpillar uses ProfitLoss for net income, Google reports only Depreciation rather than a combined D&A tag. The canonical-concept map carries multiple alternates per logical line item, and per-industry maps cover the divergent vocabulary (banks use InterestIncomeExpenseNet and the post-CECL FinancingReceivableExcludingAccruedInterestAfterAllowanceForCreditLoss instead of revenue / total loans).

Track B — Claude over the 10-K HTML

Where Track A leaves a field unfilled, Track B runs. The 10-K's Item 8 (Financial Statements) section is sliced out by anchor pattern, sent to Claude with a confidence-calibrated extraction prompt, and parsed into the same LineItem schema. Every value Claude returns carries a verbatim 5–30 word source quote from the filing — visible in the human-in-the-loop review surface for one-click verification. The static system prompt is marked for prompt caching so subsequent extractions amortize the prefix cost.

Derivation backstop

A small set of accounting-identity fallbacks runs between Track B and composition. Some filers don't tag a field at all — JNJ and NKE don't report a separate operating income line; NKE and KO don't expose a total-liabilities tag. Rather than fail the request, derive: operating income from income_before_tax + interest_expense; total liabilities from total_assets − shareholders_equity. Both write source=DERIVED with a synthetic source quote describing the formula, so provenance survives the inference and the HITL surface can flag them for review.

Composition and validation

The Company is composed at the end, after both tracks plus derivation have run. Required fields that even derivation can't reach raise an explicit error. Validation flags low-confidence items (less than 0.80) and balance-sheet identity violations (over 50bps). Overrides are persisted as LineItem entries with source=USER_OVERRIDE and re-trigger validation on each write.

Modeling and Monte Carlo

Once the line items are in place, the user adjusts assumptions on sliders — revenue growth, operating margin, terminal growth, WACC — and a 5-year three-statement projection, 10,000 Monte Carlo iterations, and a 7×7 sensitivity grid recompute under 200 ms. The Monte Carlo distribution and sensitivity heatmap are surfaced as Recharts visualizations alongside the per-share fair value. That FCFF flow fits industrials and tech.

Industry-specific valuation flavors

The other four industries each get their own valuation, dispatched on period.industry in compute_projection. The same Assumptions shape is reused across all of them — the frontend relabels the sliders to match the formula's variables, so users see "Cost of equity" instead of "WACC" on the bank workspace, etc.

  • Banks — Gordon dividend-discount model, P = D₀(1 + g) / (r − g). Banks have no "operating margin" in the industrial sense; the economic story is interest spread net of credit costs. The cost-of-equity and dividend-growth sliders replace WACC and terminal growth.
  • Insurers — justified price-to-book, fair_value/share = book_value/share × (ROE − g) / (r − g). Reserves and the general-account investment portfolio dominate the balance sheet, so book value is the economic anchor.
  • REITs — FFO-multiple Gordon, fair_value/share = FFO/share × (1 + g) / (r − g), where FFO = net income + D&A. GAAP depreciation overstates economic depreciation for well-maintained real estate, so FFO, not GAAP net income, is the conventional REIT earnings measure.
  • Energy E&P — 10-year reserve-life-capped FCFF with no terminal value. Reserves deplete; Gordon-growth-to-infinity is conceptually wrong for an asset that will run out. The revenue-growth slider is relabeled "production growth/decline."

The first three each carry a schema variant on the wire — banks tag net interest income and loans/deposits, insurers tag premiums and reserves, REITs tag a real-estate-at-cost / accumulated-depreciation / real-estate-net trio. E&P is the exception. Revenue, operating income, capex, and D&A are all standard us-gaap concepts even for an E&P filer — the line-item set isn't different, the valuation is — so the architecture supports a "dispatch-only" variant: no schema split, all the variant logic lives in dcf.py and a slider-relabel on the frontend. Sensitivity is hidden client-side for banks / insurers / REITs because their formulas don't read the rev-growth × op-margin axes; for E&P the heatmap stays on, since the FCFF math still uses both.

Implementation considerations

The hardest design problem was making Track A non-fatal. The first version raised an exception whenever any one required field was missing, which meant Track B never got a chance for filers that didn't tag operating income (NKE, JNJ) or didn't tag total liabilities (NKE, KO). The architectural fix was to refactor Track A to return a partial dict and let Track B fill required gaps too. Composition happens at the end, not at Track A's exit.

XBRL restatement handling has a gotcha. Each XBRL data point carries an fy field for the filing's fiscal year — but a 10-K filed for FY2025 reports comparative income statements for FY2024 and FY2023, all tagged fy=2025. Grouping by fy collides three years of data into one slot. Grouping by end date instead is the correct key. This bug would have produced subtly wrong numbers without any visible error, which is the worst kind.

Source attribution is the design move that makes this credible. Every Claude-extracted value carries a verbatim quote from the filing. This makes the HITL review one click, not a manual hunt — and turns the system from a black box into something a finance reviewer can verify against the underlying document.

Reflections

  • Schema-variant industries and dispatch-only ones live in different places in the codebase — and the second category turned out to matter. Banks, insurers, and REITs each carry their own discriminated union per statement (kind = "bank", "insurer", "reit") plus a per-industry XBRL concept map, because the line items they tag are structurally different (no "operating margin" on a bank's P&L; a REIT's balance sheet is dominated by real-estate-at-cost less accumulated depreciation; an insurer's largest line is policy reserves). Energy E&P, by contrast, reports on standard us-gaap — revenue, op income, capex, and D&A all map to the industrial schema — so the variant lives entirely in dcf.py's dispatch and a slider-relabel on the frontend, with no schema split on the wire. The original universe was 10 industrial / tech tickers; all four additional industries landed without a parallel codebase, and each shipped in roughly the same effort because the variant always landed in the right place — schema variant when the data shape differed, dcf.py dispatch when only the valuation math did.
  • Three of the original ten tickers needed Track B or derivation to compose; the four variant tickers all extracted cleanly through Track A alone. XBRL tagging consistency turned out to be worse than the universe size suggests — even among hand-picked clean-reporting filers, ~30% have at least one required line item that's untagged or under a non-canonical concept. JPM, PRU, PLD, and EOG all extracted cleanly via XBRL because their per-industry tags (post-CECL bank tags, life-insurer reserve tags, REIT real-estate tags, and the E&P-specific oil-and-gas-property and depletion tags) are well-standardized within their own taxonomy. The two-track-plus-derivation architecture earns its keep on industrials; the per-industry XBRL maps are why the variant filers compose without needing Claude at all.
  • Persistence shipped most recently. Overrides used to live in process memory; a Railway redeploy wiped them. The repo layer now picks Postgres automatically when DATABASE_URL is present (via Railway's Postgres plugin) and falls back to in-memory locally — same JSONB blob per ticker, same override semantics, but durable.
  • Scope honesty lives in the UI copy, not just the docs. Below the curated 14-ticker grid is a free-text search that accepts any SEC-filed company; its caveat names the same five-industry coverage as the README, but at the point where the user is about to choose rather than buried in a doc. AMZN extracts and values via the standard path — it works, but Amazon's reinvestment profile breaks the steady-state Gordon-terminal assumption, so the fair value reads conservatively. The escape hatch exists; the choice was to make the caveat unmissable, not the hatch unreachable.

Closing observation

The most useful principle: make extraction failures visible rather than hide them. A valuation that comes with a flagged-items list and source quotes per line item is more honest and more useful than one that arrives with full confidence and no provenance. The verification surface is what turns this from a demo into something a finance reviewer would actually trust.