E5: Corpus Ingestion Scale-Up — Epic Design Doc

Status: 🔄 In Refinement (Step 0) Authors: Dan Hannah & Clay Created: 2026-04-24 Parent: QuoteAI Project Design Doc

Overview

What Is This Epic?

Scale the ingestion pipeline from the curated demo subset (5 quotes, 32 line items, 3 products as of 2026-04-24) to the modern Brehob corpus (~5K files). The existing pipeline (ingestion/orchestrator.ts) is architecturally correct but was built and validated for the demo — E5 adapts it for corpus-scale throughput, pre-flight routing, and post-ingestion index tuning.

Motivation: the Andy demo and early Brehob pilot will produce retrieval gaps ("Sized at 100 HP, 125 PSI, 460/3/60 — needs verification") on every unfamiliar equipment line the catalog doesn't cover. At demo-subset scale, nearly every line has a gap. Ingesting the modern corpus closes the gap for ~95%+ of what real salespeople will quote.

Goals

Ingest the modern-era Brehob corpus (~5K files) end-to-end
Add a pre-flight classifier gate so junk / pricing-only / template files skip extraction
Add parsed-text hash dedupe so files that are byte-different but text-identical (same quote saved to 3 folders) skip extraction
Retune ivfflat index (lists, probes) after corpus lands — current params are sized for 5 quotes
Capture operational run record so re-ingestion 6 months from now doesn't require re-deriving the plan

Non-Goals

Legacy .doc ingestion (pre-2010 era) — ~8,600 files, mostly historical. Would require libreoffice --convert-to or antiword in the pipeline, parse reliability is poor, and phrasing / product mix don't match what John quotes today. Revisit only if demo retrieval gaps surface pre-2010 equipment.
Template A (~2005) .xls files — 19% of readable corpus, same era-mismatch argument.
Sonnet batch-extraction (N files per call) — evaluated and rejected; see E5-D4.
New extractor prompts or schema changes — E5 is a scale-up of the existing pipeline, not a rewrite. Prompt / schema iteration stays in the extractor module under its existing versioning.
Semantic line-item dedupe across quotes — ivfflat handles near-duplicates at query time via ranking. Storage dedupe at the content_hash grain is sufficient.

Problem Statement

Current DB state (2026-04-24): 5 past_quotes, 32 line_items, 3 products. This is the demo subset. The ingestion/cache/ directory has 17,319 files totaling 3.8 GB, representing 22 years of Brehob quote history plus product spec sheets.

For Andy's pilot and real-world usage, retrieval gaps will dominate the UX — the AI will flag nearly every line item as "needs verification" because nothing matches in the catalog. The fix is ingesting the full modern corpus.

The naive approach ("shovel everything in") is wrong on three counts:

Half the corpus is legacy .doc (pre-2010). High parse cost, low-signal phrasing, product mix doesn't match modern quotes.
Duplicate line items crush retrieval value. Brehob's quotes are template-heavy — the same line-item description appears verbatim across hundreds of files. ivfflat ranking handles this at query time; storage-level dedupe (existing content_hash) handles it at write time.
The pipeline currently runs the same two Haiku calls on every file regardless of content, including pricing-only worksheets and template boilerplate. A pre-flight classifier gate saves 20-30% of extractor calls.

Context

Current State

ingestion/orchestrator.ts walks a directory, infers type from folder path (Customer Quotes → quote, else product), runs parse → Haiku extract → OpenAI embed → Postgres load.
Hash-based skip at source_hash grain — re-runs of unchanged files are free.
--parallel N flag (default 1) controls concurrency across the worker pool.
Extraction runs via claude CLI in print mode, not the Anthropic SDK. This uses Dan's Max 20x plan — dollar cost is effectively zero; rate limits are the real constraint.
Loader-level content_hash dedupe on verbatim line-item descriptions — duplicates write once.

Dependencies

E1 (Ingestion + Vector DB) — extends the pipeline from this epic.
E2 (MCP Servers) — unchanged; retrieval consumes the expanded DB through the same MCP surface.
No new infrastructure.

Affected Systems

Layer	How affected
`ingestion/orchestrator.ts`	New pre-flight classifier gate, parsed-text hash skip, per-run log output
`ingestion/extractor/`	Unchanged — same extractors, same versions
`ingestion/loader/`	Unchanged — `content_hash` already handles line-item dedupe
Postgres schema	No DDL; only new rows
ivfflat indexes	Rebuilt once post-ingestion with retuned `lists` and `probes`

Design

Corpus Scope

From the 17,319 files in ingestion/cache/ (counted 2026-04-24):

Type	Count	Ingest?	Notes
`.xlsx`	2,992	✅	Modern Excel, parses cleanly
`.xls`	4,780	✅ (~50% readable)	Template B + C era; ~50% encrypted by legacy default-password, SheetJS-community can't decrypt
`.pdf`	768	✅	Product spec sheets, datasheets
`.docx`	184	✅	Modern Word quotes, proposals
`.doc`	8,595	❌	Legacy Word (pre-2010); skip for MVP

Estimated eligible after encryption + skip filters: ~5,000-5,500 files.

Pipeline Extensions

1. Parsed-text hash skip (before extract)

Today: orchestrator.ts skips on source_hash (file bytes) match. If two files have identical content but different bytes (e.g., resaved via Excel with updated metadata) or the same quote exists in multiple folders, we run the full extractor on both.

Change: after parseFile(), compute sha256(parsed.text) and check a new DB index (parsed_text_hash). If match → skip extract, log as skipped_text_dupe. Updates past_quotes / products to record the duplicate source_file pointer for operational visibility.

Trade-off: a second source_file row per duplicate adds some noise in past_quotes, but preserves the audit trail ("this equipment was quoted for 3 different customers").

2. Pre-flight classifier gate

Before the extract branch, run one LLM call (Haiku via CLI) that classifies the parsed text into:

customer_quote — full customer quote with line items. Route to ingestQuote.
product_spec — datasheet / catalog entry. Route to ingestProduct.
pricing_only — worksheet with numbers but no equipment descriptions. Skip.
template — empty template / boilerplate scaffold. Skip.
junk — unparseable content, corrupted data, etc. Skip.

Replaces the current inferType() folder-path heuristic (Customer Quotes marker), which is brittle (product specs saved in quote folders, quotes saved in catalog folders).

Cost: 1 extra Haiku call per file. Savings: ~20-30% of extraction calls avoided on junk / pricing-only / template files. Net win on throughput and correctness.

Start with Haiku for throughput; escalate to Sonnet only if observed misclassification rate exceeds ~5%.

3. Concurrency

--parallel 10 is the target. Beyond that, we hit Max-plan rate limits (Haiku), not CPU. Parser parallelism (especially PDF + .xls) is the secondary ceiling — Node is single-threaded for parsing but the worker pool uses await-based concurrency so I/O-bound phases (LLM, embed, DB) overlap cleanly.

4. Run log

Write per-run operational summary to ingestion/RUNS.md (new file). One entry per production run:

## 2026-04-24 — Full modern corpus ingestion
- Scope: .xlsx + .xls + .pdf + .docx (skipped .doc)
- Stats: Scanned X / Eligible Y / Processed Z / Failed F / Empty E / TextDupe D
- Wallclock: Xh Ym (--parallel 10)
- DB growth: past_quotes A→B, quote_line_items C→D, products E→F
- ivfflat retune: lists X→Y, probes Z
- Notable failures: (top categories)

Operational history, not design rationale. Design rationale lives in this epic + decisions.md.

Post-Ingestion Index Retune

Current: ivfflat indexes on past_quotes.embedding, quote_line_items.embedding, products.embedding. Default lists = 20, per-session probes = 20.

Why retune: lists = sqrt(N) is the pgvector rule of thumb. At N = 5 we overshot; at N = 25,000+ line items we'll undershoot. Without retune, query recall degrades gracefully (not catastrophically), but latency grows. Better to rebuild once after ingestion lands than drift.

Target after corpus lands:

quote_line_items: lists = 160 (sqrt(25000) ≈ 158)
past_quotes: lists = 70 (sqrt(5000) ≈ 71)
products: lists = 25 (sqrt(500) ≈ 22)
probes: keep 20 for now — recall/latency trade can be tuned later against real queries

Retune is a one-time REINDEX CONCURRENTLY after the big run completes. Not part of the per-run cost.

What We Explicitly Rejected

Sonnet batch-extraction (N files per call). Tempting throughput win on paper (~5x fewer calls). In practice:

Quality degrades on multi-item extraction — middle items in a batch get less attention (LLM recency effect).
One file parse-weird corrupts the whole batch's JSON output; partial-recovery orchestration isn't worth it.
Shared-prompt savings are small (~15% of payload) for our long-file case.
Parallelism already delivers the throughput we need.

See E5-D4 for full rationale.

Haiku-queries-DB-for-dedupe. Adds latency (DB + LLM reasoning) to every extraction. "Is this equipment already known" is literally what pgvector cosine similarity is for. Three-layer hash (source / parsed-text / content) covers the correct dedupe grains.

See E5-D2 for full rationale.

API / Interface Changes

No new external APIs. Orchestrator CLI gains:

Flag	Purpose
`--skip-legacy-doc`	Skip `.doc` files (defaulted on for MVP scope)
`--no-classifier`	Bypass classifier gate; fall back to folder-path inference (escape hatch)
`--run-label <name>`	Tag the RUNS.md entry with a descriptive name

Existing flags (--dir, --limit, --parallel, --errors-log) unchanged.

Data Model

No DDL beyond an optional (parsed_text_hash) index on past_quotes and products for the text-hash skip path. Schema unchanged.

Testing / Verification Strategy

Classifier smoke — hand-label 50 files across the 5 categories, verify classifier matches Dan's ground truth ≥95%. One-shot test, not a continuous suite.
Text-hash skip smoke — pick 3 files with known content duplicates (same quote in multiple folders), verify first ingests, subsequent skip as text_dupe.
Dry-run mode — --parallel 1 --limit 50 on a random sample before the full run. Inspect classifier verdicts and extraction outputs for 50 files; catch systematic errors before the 2-hour run.
Post-run QA sweep — Sonnet reads original parsed text + extracted fields side-by-side for a random 30-quote sample, flags systematic errors. One-time validation, not continuous.
Golden test — existing ingestion/golden-test.ts should still pass after corpus landing (the "100HP oilless compressor for food-grade plant" → top-5 includes Groeb Farms, 4M Industries, Powerex SEQ1007). If this regresses, the ingestion added noise that's pushing good hits out of top-5.

Stories

#	Summary	Status
S0	Parsed-text hash dedupe in orchestrator	🔄 Pending
S1	Classifier gate — Haiku pre-flight routing	🔄 Pending
S2	`--skip-legacy-doc` + `--run-label` CLI flags + RUNS.md writer	🔄 Pending
S3	Dry-run on 100-file sample (filesystem-order); validate classifier + extractor	🔄 Pending
S4	Full modern-corpus ingestion run (`--parallel 10`)	🔄 Pending
S5	Post-run ivfflat retune (lists + probes)	🔄 Pending
S6	Post-run Sonnet QA sweep on 30-quote sample	🔄 Pending
S7	Golden-test pass verification against new corpus	🔄 Pending

Risks

Risk	Likelihood	Impact	Mitigation
Classifier misclassifies real quotes as junk	Low	🔴 High	Dry-run smoke + ground-truth check before full run; `--no-classifier` escape hatch
Max-plan Haiku rate limit hit mid-run	Medium	🟡 Medium	`--parallel 10` is conservative; orchestrator already continues-on-error so we can resume on hash-skip
Parse failures on encrypted `.xls` flood the error log	High	🟢 Low	Known behavior — expected ~50% of legacy `.xls`. Categorize in RUNS.md, don't treat as bugs
Golden test regresses after corpus lands	Medium	🟡 Medium	If top-5 slips, retune probes or investigate whether expected hits are still the right baseline
Ingested data drifts from current schema (extractor version bumps)	Low	🟡 Medium	`extractor_version` skip already handles this — bumped version re-ingests naturally

Decisions Log

Active decisions live in quoteai/decisions.md until graduated.

ID	Title	Date	Status
E5-D1	Modern-corpus scope; skip legacy `.doc`	2026-04-24	Active
E5-D2	Three-layer hash dedupe; no LLM-DB-query	2026-04-24	Active
E5-D3	Pre-flight classifier gate (Haiku)	2026-04-24	Active
E5-D4	No Sonnet batch-extraction; parallelism instead	2026-04-24	Active

E5: Corpus Ingestion Scale-Up — Epic Design Doc#

Overview#

What Is This Epic?#

Goals#

Non-Goals#

Problem Statement#

Context#

Current State#

Dependencies#

Affected Systems#

Design#

Corpus Scope#

Pipeline Extensions#

Post-Ingestion Index Retune#

What We Explicitly Rejected#

API / Interface Changes#

Data Model#

Testing / Verification Strategy#

Stories#

Risks#

Decisions Log#

Review