E1: Ingestion + Vector DB — Epic Design Doc
Status: 🔄 In Refinement (Step 0) Authors: Dan Hannah & Clay Created: 2026-04-18 Parent: QuoteAI Project Design Doc
Overview
Goals & Non-Goals
Goals:
- Pull a curated subset of Brehob source docs from Google Drive
- Convert
.doc/.xlsx/.pdf→ clean structured data using Billet-shaped parser modules - Extract structured fields for products, quotes, and line items (the atomic retrieval unit)
- Generate embeddings for products, quotes, and individual line-item descriptions
- Load everything into the pgvector schema created in E0
- Provide a reproducible ingestion CLI (
pnpm ingest) so we can rebuild the DB from scratch
Non-Goals:
- No full-corpus ingestion — curated ~20 quotes + ~20 product specs for demo
- No real-time sync with Drive — one-shot ingestion triggered manually
- No admin UI — CLI only
- No vendor quote ingestion — manual upload deferred to Full MVP
- No pricing sheet ingestion (except as text reference — not for AI pricing, per design doc)
Problem Statement
QuoteAI's core value is retrieval quality on John's 22-year description library. Without real data in the DB, the MCP tools (E2) have nothing to query and the UI (E3) has nothing to assemble. Worse: bad retrieval = bad demo = no customer. The ingestion pipeline IS the product's moat.
The data is messy (.doc files from 2008-2025, inconsistent spreadsheet layouts, varying quote structures), lives in Google Drive, and needs structured field extraction plus embedding generation before it's useful.
What Is This Epic?
A reproducible ingestion pipeline that takes raw Drive documents and produces a populated pgvector database with:
- Product catalog — structured specs from
.docspec sheets - Past quotes — text + embeddings for semantic search over full quotes
- Line items — individually embedded descriptions, the atomic retrieval unit per the design doc
- Vendor quotes — text + embeddings (reference material)
Parsers are Billet-shaped (per the main design doc's Billet/Anvil decision) so they can be extracted into Billet v0 later.
Context
Dependents
- E2 (MCP Servers) — every MCP tool queries data loaded by E1
- Demo experience — retrieval quality is the demo's single biggest make-or-break
Dependencies
- E0 (Foundation) — schema must exist, DB must be runnable
- Google Drive MCP — already connected and working (confirmed in this session)
- OpenAI API key — for
text-embedding-3-small(per design doc)
Current State
Source data lives in QuoteAI - Brehob Quote History Google Drive folder (owner: johnhannah0624@gmail.com, shared with Dan). No local copy, no DB rows, no embeddings.
Affected Systems
| System / Layer | How It's Affected |
|---|---|
ingestion/ directory | Fully populated with parsers, embedder, loader |
| Postgres | All tables populated with the curated subset |
| OpenAI API | Billed for embedding generation (expect <$1 for ~40 docs × a few hundred line items) |
| Google Drive | Read-only access via MCP |
Design
Module Structure (Billet-Shaped)
ingestion/
├── download.ts # Pull curated file list from Drive
├── parsers/ # Format conversion ONLY (Billet-shaped)
│ ├── word.ts # textutil / mammoth — .doc / .docx → clean text
│ ├── excel.ts # SheetJS — .xlsx → markdown tables
│ ├── pdf.ts # pdf-parse → text
│ └── types.ts # shared ParsedDoc interface
├── extractor/ # Haiku-based structured extraction (QuoteAI-specific)
│ ├── extractor.ts # invoke Haiku with text + schema, validate output
│ ├── schemas.ts # Zod schemas for product / quote / line-item
│ ├── prompts.ts # extraction prompts (versioned, hand-tuned)
│ └── evals.ts # hand-labeled fixture runner
├── embed.ts # OpenAI text-embedding-3-small
├── load.ts # pgvector inserts with UPSERT
├── cli.ts # `pnpm ingest --subset=demo` entry point
└── fixtures/
├── demo-subset.json # curated file IDs for demo
└── eval-labeled/ # hand-labeled extraction targets
Separation of concerns (post-Haiku-extractor decision):
| Layer | Job | Scope |
|---|---|---|
| Parser | Format bytes → clean text | Billet-shaped. Extractable to Billet v0 later. |
| Extractor | Text → structured JSON (Zod-validated) | QuoteAI-specific. Uses Haiku. |
| Embedder | Structured fields + description blocks → vectors | OpenAI text-embedding-3-small. |
| Loader | Rows → Postgres with UPSERT | Straightforward CRUD. |
Why this shape: Parsers stay dumb and reusable — exactly what Billet's adapter pattern wants. The intelligence (knowing what a quote_line_item looks like, what a product row contains) lives in QuoteAI's extractor, where it belongs. When Billet ships, parsers/ extracts cleanly. The extractor stays here since it's schema-specific to QuoteAI.
Curated Subset Selection
~20 quotes + ~20 product specs, covering:
Product specs (by manufacturer):
- Quincy: QMB30, QSI-100, QGV-100, QSV-430, QT-10, QOF-400
- Zeks: 500HSF, ZFC filters, refrigerated dryer
- Hankison: 505BC, refrigerated dryer
- Powerex: oilless scroll, vacuum
- Dekker: vacuum
- CRP: oil-water separator
Past quotes (by complexity tier):
- Simple (5): Single-product + price — e.g., Henry Ford refurbished dryer
- Standard (10): Multi-line system — e.g., Bowen Engineering 15hp duplex, Corwell Health Royal Oak, Purforms 50hp
- Complex (5): Multi-option, controls, installation — e.g., 4M Industries (our template reference), Slate Trucks 6-Turbo, Munson Hospital
Criteria: Coverage of complexity tiers, representative manufacturers, at least 2 multi-option quotes, the 4M reference for template extraction, John's language across ≥10 years.
Stored in ingestion/fixtures/demo-subset.json with Drive file IDs and metadata.
Key Algorithms / Logic
Line-Item Description Extraction (Critical)
Regex-based extraction of line-item blocks from 22-year-old .doc files is a losing game — formatting drift, multi-option variations, and inconsistent CHARACTERISTICS blocks all break brittle parsers. We use a Haiku agent with structured output instead.
Pipeline per document:
- Parser converts
.doc/.xlsx/.pdf→ clean text (Billet-shaped) - Extractor invokes Haiku with a document-type-specific prompt + Zod input schema (via tool-use mode)
- Haiku returns structured JSON matching the schema:
- Product specs: manufacturer, series/model, cooling, pressure, CFM, HP, voltage, drive, dimensions, weight, noise_dba, raw description block
- Past quotes: customer info, date, quote number, line items (nested), totals, scope of work (if present)
- Line items: category (
ROTARY SCREW AIR COMPRESSORetc.), verbatim description block, qty, specs, unit price, delivery text
- Zod validates the response; schema violations fail loud with the raw response logged
- On validation failure: retry once with an error-repair prompt; if still failing, write to
errors.logand move on (continue-on-error for demo)
Why Haiku specifically:
- Cheap: ~$0.01/doc × 40 docs = ~$0.40 for the full demo subset
- Structured output via tool-use is reliable for this pattern
- Handles format drift gracefully where regex would break
- Dan's Claude Max plan absorbs dev iteration cost
What gets embedded (critical):
- Embed verbatim line-item description blocks from the source — NOT Haiku's paraphrase
- John's exact phrasing is the moat. Haiku's job is to identify block boundaries and structured fields, not to rewrite descriptions
- Each
quote_line_items.embeddingcaptures the source language that will eventually be reused by Claude Code when assembling future drafts
Prompt versioning:
- Prompts live in
extractor/prompts.tswith version numbers - Every extraction stores
extractor_versionin the row metadata (add toproducts,past_quotes,quote_line_itemsif not already there — small migration) - Enables re-extraction with a new prompt and A/B comparison against baseline
Embedding Strategy
| Entity | Embedded? | Source Text | Purpose |
|---|---|---|---|
products.embedding | Yes | Full spec sheet text | "What products match these specs?" |
past_quotes.embedding | Yes | Full quote text + customer + summary | "Find quotes similar to this overall project" |
quote_line_items.embedding | Yes | Individual line-item description block | The atomic retrieval unit — "Find the best description language for this config" |
vendor_quotes.embedding | Yes | Full vendor quote text | Reference retrieval |
Batching: OpenAI allows ~2048 strings per request. Line items for 20 quotes ≈ 100-200 items total — one batch call.
Re-embed detection: Hash the source text; store hash in DB. On re-ingest, only re-embed changed hashes. (Not strictly needed for demo but cheap to add.)
Data Model Changes
No schema changes — E0 created everything. E1 populates:
productspast_quotesquote_line_itemsvendor_quotes
Leaves untouched:
quote_log,quote_log_items,feedback(Full MVP)
Edge Cases & Gotchas
| Scenario | Expected Behavior | Why It's Tricky |
|---|---|---|
.doc files (Office 97-2003) | Parse via textutil on macOS, fallback to antiword or cloud convert | mammoth.js handles .docx but not .doc. Most Brehob files are .doc. |
Mixed .doc + .docx + .pdf in quotes | Each routes to correct parser by extension | Need robust extension detection; some files lie about type |
| Inconsistent CHARACTERISTICS blocks | Parse what's there; store raw block for embedding anyway | Real data is messy — key:value alignment varies by file |
| Price field variations | Parse Net $X Each, Net $X, Net: $X, $X Each | John's formatting drift over 22 years |
| Multi-option quotes (~50% of quotes) | Parse each option as separate quote grouped by quote_number | Complex quotes have "Option 1" / "Option 2" tabs or sections |
| Empty template fields (Pressure: [blank]) | Store empty string; don't fail parse | Product spec templates have blanks filled per-quote |
| Duplicate customer quote files (rev1, rev2) | Keep latest modifiedTime; log others | Drive has -pricing quote 1.xls, -pricing quote 1.xls, etc. |
| Embedding API failure mid-batch | Retry with exponential backoff; resume from last successful row | OpenAI rate limits + transient errors |
Testing Strategy
Test Layers
| Layer | Applies? | Notes |
|---|---|---|
| Unit tests | Yes | Parsers are pure functions — test each format against fixture files |
| Integration tests | Yes | End-to-end: seed subset → run ingest → query DB → verify row counts + sample rows |
| Retrieval quality tests | Yes | The golden test. Given a known input scenario, do we retrieve the expected line items? |
| Visual tests | No | No UI in E1 |
| E2E / manual | Yes | Eyeball 5 random line-item descriptions post-ingest; do they match the source doc? |
Required Fixtures
| Fixture Name | What It Tests | Priority |
|---|---|---|
fixtures/4m-industries.doc | Multi-line-item parsing from the template reference | 🔴 High |
fixtures/quincy-qmb30.doc | Product spec extraction | 🔴 High |
fixtures/slate-trucks.xlsx | Excel pricing work sheet parsing (reference only, not for AI use) | 🟡 Medium |
fixtures/golden-scenario.json | Input: "100HP oilless compressor for food-grade plant." Expected: retrieve Groeb Farms, 4M, and Powerex SEQ1007 line items in top 5. | 🔴 High |
Verification Rules
- Every parser has a fixture-driven unit test. No exceptions.
- The golden test runs on every CI (once CI exists post-demo) — retrieval regression is the most important signal.
- Line-item count spot check — after ingest,
SELECT COUNT(*) FROM quote_line_itemsmust be within expected range (~100-200 for demo subset).
Stories
| Story | Summary | Status | PR |
|---|---|---|---|
| S0 | Drive downloader — pull curated subset to ingestion/cache/ | — | — |
| S1 | Word parser (.doc + .docx) via textutil + mammoth fallback | — | — |
| S2 | Excel parser (.xlsx) via SheetJS | — | — |
| S3 | PDF parser via pdf-parse | — | — |
| S4 | Haiku extractor scaffold — SDK client, tool-use mode, error handling | — | — |
| S5 | Product spec extraction: Zod schema + prompt + eval fixtures | — | — |
| S6 | Quote extraction: Zod schema + prompt + eval fixtures | — | — |
| S7 | Line-item extraction: Zod schema + prompt + eval fixtures — the critical one | — | — |
| S8 | Extraction eval runner (compare Haiku output vs hand-labeled fixtures) | — | — |
| S9 | Embedder (OpenAI text-embedding-3-small, batched) | — | — |
| S10 | Loader (pgvector inserts with UPSERT + hash-based re-embed detection) | — | — |
| S11 | CLI integration (pnpm ingest --subset=demo) + golden retrieval test | — | — |
Known Issues / Tech Debt
| Issue | Severity | Notes |
|---|---|---|
| No incremental ingestion (full re-run each time) | 🟡 Medium | Fine for demo; Full MVP should support incremental by hash |
| No monitoring of embedding costs | 🟢 Low | Demo cost is negligible (<$1); add cost logging in Full MVP |
| Parsers coupled to Brehob format quirks | 🟡 Medium | Expected — parsers generalize when extracted to Billet; in-repo keeps them pragmatic |
| No retry/queue for failed rows | 🟡 Medium | Accept for demo; add a .ingestion-errors.log and move on |
Open Questions
All E1 open questions resolved as of 2026-04-18. See Decisions Log for the locked-in outcomes. Add new questions here as they emerge during implementation.
Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Line-item extraction accuracy is <90% on the subset | Medium | 🔴 High (demo value depends on this) | Iterate on the extractor against 5-10 hand-labeled fixtures before running on the full subset. Accept imperfection on outliers. |
| Retrieval quality is "meh" despite clean data | Medium | 🔴 High | Build the golden test first, then iterate. Try hybrid search (BM25 + vector) if pure vector underperforms. |
.doc parsing has edge cases we don't anticipate | Medium | 🟡 Medium | Spot-check 5 random files after ingest; fall back to raw-text-only storage for files the parser chokes on (still embeddable, just no structured fields) |
| OpenAI embedding API changes or is rate-limited | Low | Medium | Batch + retry with backoff. If needed, fall back to local ONNX embeddings (same model Anvil uses) |
| Curated subset is too small to demonstrate John's voice | Medium | 🟡 Medium | 20 quotes covers 10+ years of John's language. If it feels thin post-ingest, expand to 40 before the demo. |
Decisions Log
| Date | Decision | Rationale | Alternatives Considered |
|---|---|---|---|
| 2026-04-18 | Parsers structured as Billet-shaped modules in-repo | Per main design doc's compounding decision — QuoteAI informs Billet v0 | Depend on Billet (rejected: Billet isn't built), hand-roll without structure (rejected: no compounding) |
| 2026-04-18 | Line-item embeddings as atomic retrieval unit | Per main design doc — "individual LINE ITEM DESCRIPTIONS are the reusable atoms" | Quote-level only (rejected: too coarse), paragraph-level (rejected: arbitrary boundaries) |
| 2026-04-18 | Curated subset — minimal curation, iterate as needed | Grab ~15-20 docs with a few known-good picks (4M reference, John-flagged examples, recent 2025-2026 customer quotes, rest random). Pipeline supports incremental additions natively via hash-based re-embed detection. Iterate when the golden test reveals gaps. | Rigid tier-based criteria (rejected: premature curation), full-corpus ingest (rejected: slow feedback) |
| 2026-04-18 | OpenAI text-embedding-3-small | Per main design doc; Claymore already has key; 1536-dim matches schema | Local ONNX (retained as fallback), Voyage/Cohere (rejected: another vendor) |
| 2026-04-18 | Hash-based re-embed detection | Cheap to add, saves cost on re-runs | Always re-embed (rejected: wasteful) |
| 2026-04-18 | Golden test is the primary quality signal | Retrieval quality IS the product | Rely on eyeballing only (rejected: no regression protection) |
| 2026-04-18 | Haiku-based structured extraction (not regex) | .doc files over 22 years have too much format drift for brittle regex. Haiku with Zod tool-use handles variation gracefully. ~$0.01/doc, Claude Max absorbs dev cost. Pattern compounds: same idea could land in Billet v2. | Regex-only (rejected: brittle, see 22 years of format drift), Sonnet (rejected: overkill for structured extraction), LLM without schema validation (rejected: need Zod gate before DB writes) |
| 2026-04-18 | Embed verbatim description blocks, NOT Haiku-paraphrased text | John's exact phrasing IS the moat. Haiku identifies boundaries and structured fields; it does not rewrite descriptions. | Embed Haiku summaries (rejected: loses the source-language advantage) |
| 2026-04-18 | Continue-on-error strategy for extraction failures | One broken file shouldn't stop the whole ingest. Log to errors.log, move on, fix on next run. | Fail-fast (rejected: painful on 40-doc runs) |
| 2026-04-18 | Versioned prompts with extractor_version in DB | Enables A/B comparison when we iterate on extraction prompts | Unversioned prompts (rejected: can't detect regressions) |
| 2026-04-18 | Golden test scenario: "100HP oilless compressor for food-grade plant" | Expected retrievals (top 5) include Groeb Farms, 4M Industries, Powerex SEQ1007 line items. Gate: E1 cannot ship unless this passes. Same scenario also runs end-to-end through E2 + E3 as the demo walkthrough. | Generic sanity query (rejected: no coverage of John's distinctive language); multi-scenario battery (deferred: one scenario first, add more after demo) |
| 2026-04-18 | Schema-versioning + hash columns folded into E0's initial schema (no mid-epic migration) | Adds extractor_version TEXT, source_hash TEXT, content_hash TEXT to products, past_quotes, quote_line_items in 001_init.sql. Cheaper to land now (no data yet) than to migrate after E1 ingests. Unblocks Open Q #6 (hash strategy) by committing to both hash columns. | Post-E1 migration (rejected: more pain, more risk); separate versioning table (rejected: over-engineered) |
| 2026-04-18 | Drive access: manual download (Option A) | Dan manually pulls the curated subset to ingestion/cache/; CLI processes local copies. Drive MCP can't be called from a Node CLI (only MCP clients). Zero setup cost, ~20 files takes minutes. Full MVP will upgrade to Option B (Drive API with OAuth) when automation matters. | Drive API with OAuth (deferred to Full MVP: credential setup overhead not justified for demo); CC-orchestrated ingest (rejected: awkward architecture split) |
| 2026-04-18 | Extraction eval fixtures: 5-10 hand-labeled pairs | 2 simple quotes + 3 standard + 2 complex + 2-3 product specs. Each = source .doc + expected JSON. Runner diffs Haiku output vs expected, reports deltas. Unit-test-grade coverage for the extractor; catches regressions when prompts evolve. | Fewer (rejected: false confidence); many more (rejected: tedious to maintain); eyeball-only (rejected: no regression signal) |
| 2026-04-18 | Vendor quotes in MVP ingestion | Ingest 3-5 vendor quotes (e.g., US Wire Rope installation quote) referenced in the curated past-quote subset. Small scope add. Improves install-section description quality since the 4M template pulls language from vendor install quotes. | Skip vendor quotes until Full MVP (rejected: weakens demo's install-section output) |
| 2026-04-18 | End-to-end benchmark quotes (distinct from golden retrieval test) | Pick 2-3 finished Brehob quotes (4M Industries, Bowen Engineering, Munson Hospital as candidates) as system-output ground truth. Given the same form inputs, compare generated draft against benchmark for description fidelity, formatting, completeness. Complementary to golden retrieval test: this tests E1+E2+E3 together. Lives in E3 Testing Strategy. | Golden retrieval test only (rejected: doesn't test assembled output quality) |