Overview
Multifamily operators turn lease PDFs into structured data at portfolio scale — during acquisitions, audits, and renewals. A language model can read a lease and emit the fields; that part is close to free. The hard part is making a reviewer trust the output, which means showing — for every rent amount, deposit, and term date — exactly where in the document it came from.
Tenancy extracts nine structured sections from a residential lease, flags anything ambiguous into a human-review queue, and on every field offers a click-to-highlight back to the source page. The design bet is narrow and load-bearing: the highlight has to be anchored to the document, because a highlight in the wrong place erodes a reviewer's trust faster than no highlight at all.
Problem framing
Three observations shape the design, and all three are about the highlight, not the extraction:
- The extraction is the easy part; the highlight is the product. Claude reads a Texas Apartment Association lease and returns the nine sections — parties, property, term, rent, deposits, utilities, pets, special clauses, compliance — without much coaxing. What a property-management reviewer actually needs is to click a field and see the source line light up in the PDF. That link is where the system repeatedly broke.
- A wrong-place highlight is worse than none. The first approach was text matching: find the extracted snippet in the PDF's text layer and highlight it. Twelve-plus iterations of increasingly elaborate fuzzy matching still produced confident highlights on the wrong line — and a reviewer who catches the tool highlighting the wrong place once stops trusting every highlight after. The honest fallback (highlight nothing when you're not sure, and say so) beats a clever wrong guess.
- Letting the model emit coordinates fails exactly where it matters. The obvious fix — ask Claude's vision to return a bounding box per field — drifted three to eight percent on filled values and boxed entire section headers when the field was a blank template placeholder. The blank and scanned fields are precisely the ones a reviewer most needs to verify, so a method that degrades there is a method that fails the job.
Solution
The pipeline ingests and OCRs the PDF, extracts the nine sections grounded on page images, derives highlight coordinates from OCR positions rather than model estimation, validates the result into an exception queue, and answers grounded questions over the stored extraction. The same backend is exposed as an MCP server.
Page-image-grounded extraction
Each section is extracted by Claude Sonnet over the OCR'd text plus a rendered PNG of each page. The split is explicit in the prompt: the image is ground truth for visual fields — checkboxes, signatures, hand-fill — and the OCR text is ground truth for dense prose. This fixed a class of false-positive checkbox reads where Tesseract had read scan noise as a stray mark and the model trusted the text.
The highlight pivot — OCR-anchored bounding boxes
The model's response contract carries no coordinates. It returns {value, snippet, match_type, section_label} and nothing geometric. The backend then aligns each snippet against pdfplumber's word-level positions in the OCR'd PDF using rapidfuzz partial-ratio matching, groups the matched words into lines by their vertical coordinate, and emits one bounding box per line — the PDF specification's QuadPoints highlight model, the same shape Adobe and Mendeley render. The frontend draws absolutely-positioned overlays scaled to the rendered page width. Filled values land tight on the value; multi-line values render as stacked rectangles; blank template fields anchor on the labeled blank line rather than the section header above it.
Exception queue with real resolve semantics
Validation rules — required-field presence, date ordering, confidence thresholds — flag exceptions as blocking or warning. The three resolve actions are genuinely distinct: approve accepts the current value and clears the blocker; edit walks the extraction JSON to the exception's field_path, rewrites the leaf value, and bumps confidence to 1.0 (a human said so); reject closes the row for audit but keeps the blocking flag material. A derived ready_to_proceed boolean on every lease is true only when the lease is complete and no blocking exception is left unresolved-or-rejected.
Grounded Q&A and the MCP surface
Claude Haiku answers natural-language questions over the stored extraction — never the raw PDF — so every answer cites a field_path, page, and snippet, and the citation clicks through to the same highlight overlay. The whole agent is also a six-tool MCP server (list_leases, get_lease, extract_lease, query_lease, list_exceptions, resolve_exception), so it runs inside Claude Desktop against the same Railway backend the SaaS UI uses.
Implementation considerations
The strict matcher was the right retreat, not the answer. After the fuzzy-matching iterations, the interim ship was exact-normalized-match-only: highlight only on an exact text-layer hit, and fail silently otherwise. That made the failure mode honest — no more wrong-place highlights — but it couldn't highlight a scanned page or a blank field at all, because there was no text-layer string to match. It bought time to build the real thing without shipping a dishonest one.
The model must not emit coordinates. The pattern I found across the OCR and document-layout tooling I looked at was consistent: let OCR own geometry and let the model own semantics. Coordinate generation from a raster is exactly where model-emitted boxes drift — it is the part to take away from the model, not tune. The pivot was to stop asking Claude for boxes and start deriving them from OCR positions the model never sees.
A many-image API limit bit late, on a rotated scan. Page renders are capped at 1950px on the longest side because Anthropic's many-image requests cap each image at 2000px — and every extraction sends nine section calls, each a many-image request. Letter at 150 DPI is fine; a legal-size or rotated-and-OCR'd scan would 400 the whole extraction with an image-dimensions error until the per-page scale was clamped.
The renderer is bbox-source-agnostic by construction. The frontend consumes bboxes: BoundingBox[] and is blind to where the boxes came from. The vision-to-OCR pivot touched zero renderer code, and the eventual production-grade upgrade — AWS Textract SELECTION_ELEMENT for checkbox and signature geometry — is the same clean backend swap.
Reflections
- The saga was the lesson. Twelve-plus iterations of fuzzy text matching, then a strict-match retreat that failed silently, then model-emitted boxes that drifted, then OCR-anchored boxes that held. The answer turned out to be an architecture — OCR for geometry, model for semantics — not a better heuristic. The detour lives in the commit history and the status log; I left it there, because the wrong turns are the part of the story that's actually instructive.
- Verified end-to-end, not measured. On a fresh lease I checked field by field: filled values land tight, a multi-line owner address renders as two stacked rectangles, and a blank monthly-rent field anchors on the labeled blank line in the rent section rather than the section header. There is no formal eval set yet, so I describe the highlight quality from direct verification — not a single accuracy percentage I couldn't defend under questioning.
- The coverage gap is stated where the user sees it, not just in the docs. Checkboxes and signatures aren't highlighted; that geometry is deferred to a Textract follow-up. Rather than guess a box, the app navigates to the right page and shows no overlay — the same silent-miss-over-wrong-place rule that governs the whole feature. A reviewer is told the truth at the moment of the click.
- Two surfaces, one backend. The SaaS frontend (Next.js on Vercel) and the six-tool MCP server both hit the same FastAPI + LangGraph backend (Railway). The MCP turns "show me what's flagged on this lease and fix the start date to January 1st" into a multi-tool exchange in Claude Desktop —
list_exceptions, thenget_leasefor context, thenresolve_exceptionwith the correction — against live production data. - What's deferred is deferred honestly. No multi-tenant auth, no eval set, no closed-loop feedback yet. The exception queue already stores human corrections as first-class data; feeding those corrections back into the extraction prompt is the obvious next step and is explicitly not built. Calling that out is cheaper than discovering later that the demo implied it.
Closing observation
The most useful principle carried straight over from the rest of this portfolio's work:
Make the failure visible. A lease abstraction that highlights nothing when it isn't sure — and says so — is more trustworthy than one that confidently boxes the wrong line.
The highlight is the part a reviewer actually checks. Getting it honest mattered more than getting it everywhere — and that is the decision the whole system is organized around.