E5: Corpus Ingestion Scale-Up — Epic Design Doc
Status: 🔄 In Refinement (Step 0) Authors: Dan Hannah & Clay Created: 2026-04-24 Parent: QuoteAI Project Design Doc
Overview
What Is This Epic?
Scale the ingestion pipeline from the curated demo subset (5 quotes, 32 line items, 3 products as of 2026-04-24) to the modern Brehob corpus (~5K files). The existing pipeline (ingestion/orchestrator.ts) is architecturally correct but was built and validated for the demo — E5 adapts it for corpus-scale throughput, pre-flight routing, and post-ingestion index tuning.
Motivation: the Andy demo and early Brehob pilot will produce retrieval gaps ("Sized at 100 HP, 125 PSI, 460/3/60 — needs verification") on every unfamiliar equipment line the catalog doesn't cover. At demo-subset scale, nearly every line has a gap. Ingesting the modern corpus closes the gap for ~95%+ of what real salespeople will quote.
Goals
- Ingest the modern-era Brehob corpus (~5K files) end-to-end
- Add a pre-flight classifier gate so junk / pricing-only / template files skip extraction
- Add parsed-text hash dedupe so files that are byte-different but text-identical (same quote saved to 3 folders) skip extraction
- Retune
ivfflatindex (lists, probes) after corpus lands — current params are sized for 5 quotes - Capture operational run record so re-ingestion 6 months from now doesn't require re-deriving the plan
Non-Goals
- Legacy
.docingestion (pre-2010 era) — ~8,600 files, mostly historical. Would requirelibreoffice --convert-toorantiwordin the pipeline, parse reliability is poor, and phrasing / product mix don't match what John quotes today. Revisit only if demo retrieval gaps surface pre-2010 equipment. - Template A (~2005)
.xlsfiles — 19% of readable corpus, same era-mismatch argument. - Sonnet batch-extraction (N files per call) — evaluated and rejected; see E5-D4.
- New extractor prompts or schema changes — E5 is a scale-up of the existing pipeline, not a rewrite. Prompt / schema iteration stays in the extractor module under its existing versioning.
- Semantic line-item dedupe across quotes — ivfflat handles near-duplicates at query time via ranking. Storage dedupe at the
content_hashgrain is sufficient.
Problem Statement
Current DB state (2026-04-24): 5 past_quotes, 32 line_items, 3 products. This is the demo subset. The ingestion/cache/ directory has 17,319 files totaling 3.8 GB, representing 22 years of Brehob quote history plus product spec sheets.
For Andy's pilot and real-world usage, retrieval gaps will dominate the UX — the AI will flag nearly every line item as "needs verification" because nothing matches in the catalog. The fix is ingesting the full modern corpus.
The naive approach ("shovel everything in") is wrong on three counts:
- Half the corpus is legacy
.doc(pre-2010). High parse cost, low-signal phrasing, product mix doesn't match modern quotes. - Duplicate line items crush retrieval value. Brehob's quotes are template-heavy — the same line-item description appears verbatim across hundreds of files. ivfflat ranking handles this at query time; storage-level dedupe (existing
content_hash) handles it at write time. - The pipeline currently runs the same two Haiku calls on every file regardless of content, including pricing-only worksheets and template boilerplate. A pre-flight classifier gate saves 20-30% of extractor calls.
Context
Current State
ingestion/orchestrator.tswalks a directory, infers type from folder path (Customer Quotes→ quote, else product), runs parse → Haiku extract → OpenAI embed → Postgres load.- Hash-based skip at
source_hashgrain — re-runs of unchanged files are free. --parallel Nflag (default 1) controls concurrency across the worker pool.- Extraction runs via
claudeCLI in print mode, not the Anthropic SDK. This uses Dan's Max 20x plan — dollar cost is effectively zero; rate limits are the real constraint. - Loader-level
content_hashdedupe on verbatim line-item descriptions — duplicates write once.
Dependencies
- E1 (Ingestion + Vector DB) — extends the pipeline from this epic.
- E2 (MCP Servers) — unchanged; retrieval consumes the expanded DB through the same MCP surface.
- No new infrastructure.
Affected Systems
| Layer | How affected |
|---|---|
ingestion/orchestrator.ts | New pre-flight classifier gate, parsed-text hash skip, per-run log output |
ingestion/extractor/ | Unchanged — same extractors, same versions |
ingestion/loader/ | Unchanged — content_hash already handles line-item dedupe |
| Postgres schema | No DDL; only new rows |
| ivfflat indexes | Rebuilt once post-ingestion with retuned lists and probes |
Design
Corpus Scope
From the 17,319 files in ingestion/cache/ (counted 2026-04-24):
| Type | Count | Ingest? | Notes |
|---|---|---|---|
.xlsx | 2,992 | ✅ | Modern Excel, parses cleanly |
.xls | 4,780 | ✅ (~50% readable) | Template B + C era; ~50% encrypted by legacy default-password, SheetJS-community can't decrypt |
.pdf | 768 | ✅ | Product spec sheets, datasheets |
.docx | 184 | ✅ | Modern Word quotes, proposals |
.doc | 8,595 | ❌ | Legacy Word (pre-2010); skip for MVP |
Estimated eligible after encryption + skip filters: ~5,000-5,500 files.
Pipeline Extensions
1. Parsed-text hash skip (before extract)
Today: orchestrator.ts skips on source_hash (file bytes) match. If two files have identical content but different bytes (e.g., resaved via Excel with updated metadata) or the same quote exists in multiple folders, we run the full extractor on both.
Change: after parseFile(), compute sha256(parsed.text) and check a new DB index (parsed_text_hash). If match → skip extract, log as skipped_text_dupe. Updates past_quotes / products to record the duplicate source_file pointer for operational visibility.
Trade-off: a second source_file row per duplicate adds some noise in past_quotes, but preserves the audit trail ("this equipment was quoted for 3 different customers").
2. Pre-flight classifier gate
Before the extract branch, run one LLM call (Haiku via CLI) that classifies the parsed text into:
customer_quote— full customer quote with line items. Route toingestQuote.product_spec— datasheet / catalog entry. Route toingestProduct.pricing_only— worksheet with numbers but no equipment descriptions. Skip.template— empty template / boilerplate scaffold. Skip.junk— unparseable content, corrupted data, etc. Skip.
Replaces the current inferType() folder-path heuristic (Customer Quotes marker), which is brittle (product specs saved in quote folders, quotes saved in catalog folders).
Cost: 1 extra Haiku call per file. Savings: ~20-30% of extraction calls avoided on junk / pricing-only / template files. Net win on throughput and correctness.
Start with Haiku for throughput; escalate to Sonnet only if observed misclassification rate exceeds ~5%.
3. Concurrency
--parallel 10 is the target. Beyond that, we hit Max-plan rate limits (Haiku), not CPU. Parser parallelism (especially PDF + .xls) is the secondary ceiling — Node is single-threaded for parsing but the worker pool uses await-based concurrency so I/O-bound phases (LLM, embed, DB) overlap cleanly.
4. Run log
Write per-run operational summary to ingestion/RUNS.md (new file). One entry per production run:
## 2026-04-24 — Full modern corpus ingestion
- Scope: .xlsx + .xls + .pdf + .docx (skipped .doc)
- Stats: Scanned X / Eligible Y / Processed Z / Failed F / Empty E / TextDupe D
- Wallclock: Xh Ym (--parallel 10)
- DB growth: past_quotes A→B, quote_line_items C→D, products E→F
- ivfflat retune: lists X→Y, probes Z
- Notable failures: (top categories)
Operational history, not design rationale. Design rationale lives in this epic + decisions.md.
Post-Ingestion Index Retune
Current: ivfflat indexes on past_quotes.embedding, quote_line_items.embedding, products.embedding. Default lists = 20, per-session probes = 20.
Why retune: lists = sqrt(N) is the pgvector rule of thumb. At N = 5 we overshot; at N = 25,000+ line items we'll undershoot. Without retune, query recall degrades gracefully (not catastrophically), but latency grows. Better to rebuild once after ingestion lands than drift.
Target after corpus lands:
quote_line_items:lists = 160(sqrt(25000) ≈ 158)past_quotes:lists = 70(sqrt(5000) ≈ 71)products:lists = 25(sqrt(500) ≈ 22)probes: keep20for now — recall/latency trade can be tuned later against real queries
Retune is a one-time REINDEX CONCURRENTLY after the big run completes. Not part of the per-run cost.
What We Explicitly Rejected
Sonnet batch-extraction (N files per call). Tempting throughput win on paper (~5x fewer calls). In practice:
- Quality degrades on multi-item extraction — middle items in a batch get less attention (LLM recency effect).
- One file parse-weird corrupts the whole batch's JSON output; partial-recovery orchestration isn't worth it.
- Shared-prompt savings are small (~15% of payload) for our long-file case.
- Parallelism already delivers the throughput we need.
See E5-D4 for full rationale.
Haiku-queries-DB-for-dedupe. Adds latency (DB + LLM reasoning) to every extraction. "Is this equipment already known" is literally what pgvector cosine similarity is for. Three-layer hash (source / parsed-text / content) covers the correct dedupe grains.
See E5-D2 for full rationale.
API / Interface Changes
No new external APIs. Orchestrator CLI gains:
| Flag | Purpose |
|---|---|
--skip-legacy-doc | Skip .doc files (defaulted on for MVP scope) |
--no-classifier | Bypass classifier gate; fall back to folder-path inference (escape hatch) |
--run-label <name> | Tag the RUNS.md entry with a descriptive name |
Existing flags (--dir, --limit, --parallel, --errors-log) unchanged.
Data Model
No DDL beyond an optional (parsed_text_hash) index on past_quotes and products for the text-hash skip path. Schema unchanged.
Testing / Verification Strategy
- Classifier smoke — hand-label 50 files across the 5 categories, verify classifier matches Dan's ground truth ≥95%. One-shot test, not a continuous suite.
- Text-hash skip smoke — pick 3 files with known content duplicates (same quote in multiple folders), verify first ingests, subsequent skip as
text_dupe. - Dry-run mode —
--parallel 1 --limit 50on a random sample before the full run. Inspect classifier verdicts and extraction outputs for 50 files; catch systematic errors before the 2-hour run. - Post-run QA sweep — Sonnet reads original parsed text + extracted fields side-by-side for a random 30-quote sample, flags systematic errors. One-time validation, not continuous.
- Golden test — existing
ingestion/golden-test.tsshould still pass after corpus landing (the "100HP oilless compressor for food-grade plant" → top-5 includes Groeb Farms, 4M Industries, Powerex SEQ1007). If this regresses, the ingestion added noise that's pushing good hits out of top-5.
Stories
| # | Summary | Status |
|---|---|---|
| S0 | Parsed-text hash dedupe in orchestrator | 🔄 Pending |
| S1 | Classifier gate — Haiku pre-flight routing | 🔄 Pending |
| S2 | --skip-legacy-doc + --run-label CLI flags + RUNS.md writer | 🔄 Pending |
| S3 | Dry-run on 100-file sample (filesystem-order); validate classifier + extractor | 🔄 Pending |
| S4 | Full modern-corpus ingestion run (--parallel 10) | 🔄 Pending |
| S5 | Post-run ivfflat retune (lists + probes) | 🔄 Pending |
| S6 | Post-run Sonnet QA sweep on 30-quote sample | 🔄 Pending |
| S7 | Golden-test pass verification against new corpus | 🔄 Pending |
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Classifier misclassifies real quotes as junk | Low | 🔴 High | Dry-run smoke + ground-truth check before full run; --no-classifier escape hatch |
| Max-plan Haiku rate limit hit mid-run | Medium | 🟡 Medium | --parallel 10 is conservative; orchestrator already continues-on-error so we can resume on hash-skip |
Parse failures on encrypted .xls flood the error log | High | 🟢 Low | Known behavior — expected ~50% of legacy .xls. Categorize in RUNS.md, don't treat as bugs |
| Golden test regresses after corpus lands | Medium | 🟡 Medium | If top-5 slips, retune probes or investigate whether expected hits are still the right baseline |
| Ingested data drifts from current schema (extractor version bumps) | Low | 🟡 Medium | extractor_version skip already handles this — bumped version re-ingests naturally |
Decisions Log
Active decisions live in quoteai/decisions.md until graduated.
| ID | Title | Date | Status |
|---|---|---|---|
| E5-D1 | Modern-corpus scope; skip legacy .doc | 2026-04-24 | Active |
| E5-D2 | Three-layer hash dedupe; no LLM-DB-query | 2026-04-24 | Active |
| E5-D3 | Pre-flight classifier gate (Haiku) | 2026-04-24 | Active |
| E5-D4 | No Sonnet batch-extraction; parallelism instead | 2026-04-24 | Active |