E1: Ingestion + Vector DB — Epic Design Doc

Status: 🔄 In Refinement (Step 0) Authors: Dan Hannah & Clay Created: 2026-04-18 Parent: QuoteAI Project Design Doc

Overview

Goals & Non-Goals

Goals:

Pull a curated subset of Brehob source docs from Google Drive
Convert .doc / .xlsx / .pdf → clean structured data using Billet-shaped parser modules
Extract structured fields for products, quotes, and line items (the atomic retrieval unit)
Generate embeddings for products, quotes, and individual line-item descriptions
Load everything into the pgvector schema created in E0
Provide a reproducible ingestion CLI (pnpm ingest) so we can rebuild the DB from scratch

Non-Goals:

No full-corpus ingestion — curated ~20 quotes + ~20 product specs for demo
No real-time sync with Drive — one-shot ingestion triggered manually
No admin UI — CLI only
No vendor quote ingestion — manual upload deferred to Full MVP
No pricing sheet ingestion (except as text reference — not for AI pricing, per design doc)

Problem Statement

QuoteAI's core value is retrieval quality on John's 22-year description library. Without real data in the DB, the MCP tools (E2) have nothing to query and the UI (E3) has nothing to assemble. Worse: bad retrieval = bad demo = no customer. The ingestion pipeline IS the product's moat.

The data is messy (.doc files from 2008-2025, inconsistent spreadsheet layouts, varying quote structures), lives in Google Drive, and needs structured field extraction plus embedding generation before it's useful.

What Is This Epic?

A reproducible ingestion pipeline that takes raw Drive documents and produces a populated pgvector database with:

Product catalog — structured specs from .doc spec sheets
Past quotes — text + embeddings for semantic search over full quotes
Line items — individually embedded descriptions, the atomic retrieval unit per the design doc
Vendor quotes — text + embeddings (reference material)

Parsers are Billet-shaped (per the main design doc's Billet/Anvil decision) so they can be extracted into Billet v0 later.

Context

Dependents

E2 (MCP Servers) — every MCP tool queries data loaded by E1
Demo experience — retrieval quality is the demo's single biggest make-or-break

Dependencies

E0 (Foundation) — schema must exist, DB must be runnable
Google Drive MCP — already connected and working (confirmed in this session)
OpenAI API key — for text-embedding-3-small (per design doc)

Current State

Source data lives in QuoteAI - Brehob Quote History Google Drive folder (owner: johnhannah0624@gmail.com, shared with Dan). No local copy, no DB rows, no embeddings.

Affected Systems

System / Layer	How It's Affected
`ingestion/` directory	Fully populated with parsers, embedder, loader
Postgres	All tables populated with the curated subset
OpenAI API	Billed for embedding generation (expect <$1 for ~40 docs × a few hundred line items)
Google Drive	Read-only access via MCP

Design

Module Structure (Billet-Shaped)

ingestion/
├── download.ts              # Pull curated file list from Drive
├── parsers/                 # Format conversion ONLY (Billet-shaped)
│   ├── word.ts              # textutil / mammoth — .doc / .docx → clean text
│   ├── excel.ts             # SheetJS — .xlsx → markdown tables
│   ├── pdf.ts               # pdf-parse → text
│   └── types.ts             # shared ParsedDoc interface
├── extractor/               # Haiku-based structured extraction (QuoteAI-specific)
│   ├── extractor.ts         # invoke Haiku with text + schema, validate output
│   ├── schemas.ts           # Zod schemas for product / quote / line-item
│   ├── prompts.ts           # extraction prompts (versioned, hand-tuned)
│   └── evals.ts             # hand-labeled fixture runner
├── embed.ts                 # OpenAI text-embedding-3-small
├── load.ts                  # pgvector inserts with UPSERT
├── cli.ts                   # `pnpm ingest --subset=demo` entry point
└── fixtures/
    ├── demo-subset.json     # curated file IDs for demo
    └── eval-labeled/        # hand-labeled extraction targets

Separation of concerns (post-Haiku-extractor decision):

Layer	Job	Scope
Parser	Format bytes → clean text	Billet-shaped. Extractable to Billet v0 later.
Extractor	Text → structured JSON (Zod-validated)	QuoteAI-specific. Uses Haiku.
Embedder	Structured fields + description blocks → vectors	OpenAI `text-embedding-3-small`.
Loader	Rows → Postgres with UPSERT	Straightforward CRUD.

Why this shape: Parsers stay dumb and reusable — exactly what Billet's adapter pattern wants. The intelligence (knowing what a quote_line_item looks like, what a product row contains) lives in QuoteAI's extractor, where it belongs. When Billet ships, parsers/ extracts cleanly. The extractor stays here since it's schema-specific to QuoteAI.

Curated Subset Selection

~20 quotes + ~20 product specs, covering:

Product specs (by manufacturer):

Quincy: QMB30, QSI-100, QGV-100, QSV-430, QT-10, QOF-400
Zeks: 500HSF, ZFC filters, refrigerated dryer
Hankison: 505BC, refrigerated dryer
Powerex: oilless scroll, vacuum
Dekker: vacuum
CRP: oil-water separator

Past quotes (by complexity tier):

Simple (5): Single-product + price — e.g., Henry Ford refurbished dryer
Standard (10): Multi-line system — e.g., Bowen Engineering 15hp duplex, Corwell Health Royal Oak, Purforms 50hp
Complex (5): Multi-option, controls, installation — e.g., 4M Industries (our template reference), Slate Trucks 6-Turbo, Munson Hospital

Criteria: Coverage of complexity tiers, representative manufacturers, at least 2 multi-option quotes, the 4M reference for template extraction, John's language across ≥10 years.

Stored in ingestion/fixtures/demo-subset.json with Drive file IDs and metadata.

Key Algorithms / Logic

Line-Item Description Extraction (Critical)

Regex-based extraction of line-item blocks from 22-year-old .doc files is a losing game — formatting drift, multi-option variations, and inconsistent CHARACTERISTICS blocks all break brittle parsers. We use a Haiku agent with structured output instead.

Pipeline per document:

Parser converts .doc / .xlsx / .pdf → clean text (Billet-shaped)
Extractor invokes Haiku with a document-type-specific prompt + Zod input schema (via tool-use mode)
Haiku returns structured JSON matching the schema:
- Product specs: manufacturer, series/model, cooling, pressure, CFM, HP, voltage, drive, dimensions, weight, noise_dba, raw description block
- Past quotes: customer info, date, quote number, line items (nested), totals, scope of work (if present)
- Line items: category (ROTARY SCREW AIR COMPRESSOR etc.), verbatim description block, qty, specs, unit price, delivery text
Zod validates the response; schema violations fail loud with the raw response logged
On validation failure: retry once with an error-repair prompt; if still failing, write to errors.log and move on (continue-on-error for demo)

Why Haiku specifically:

Cheap: ~$0.01/doc × 40 docs = ~$0.40 for the full demo subset
Structured output via tool-use is reliable for this pattern
Handles format drift gracefully where regex would break
Dan's Claude Max plan absorbs dev iteration cost

What gets embedded (critical):

Embed verbatim line-item description blocks from the source — NOT Haiku's paraphrase
John's exact phrasing is the moat. Haiku's job is to identify block boundaries and structured fields, not to rewrite descriptions
Each quote_line_items.embedding captures the source language that will eventually be reused by Claude Code when assembling future drafts

Prompt versioning:

Prompts live in extractor/prompts.ts with version numbers
Every extraction stores extractor_version in the row metadata (add to products, past_quotes, quote_line_items if not already there — small migration)
Enables re-extraction with a new prompt and A/B comparison against baseline

Embedding Strategy

Entity	Embedded?	Source Text	Purpose
`products.embedding`	Yes	Full spec sheet text	"What products match these specs?"
`past_quotes.embedding`	Yes	Full quote text + customer + summary	"Find quotes similar to this overall project"
`quote_line_items.embedding`	Yes	Individual line-item description block	The atomic retrieval unit — "Find the best description language for this config"
`vendor_quotes.embedding`	Yes	Full vendor quote text	Reference retrieval

Batching: OpenAI allows ~2048 strings per request. Line items for 20 quotes ≈ 100-200 items total — one batch call.

Re-embed detection: Hash the source text; store hash in DB. On re-ingest, only re-embed changed hashes. (Not strictly needed for demo but cheap to add.)

Data Model Changes

No schema changes — E0 created everything. E1 populates:

products
past_quotes
quote_line_items
vendor_quotes

Leaves untouched:

quote_log, quote_log_items, feedback (Full MVP)

Edge Cases & Gotchas

Scenario	Expected Behavior	Why It's Tricky
`.doc` files (Office 97-2003)	Parse via `textutil` on macOS, fallback to `antiword` or cloud convert	mammoth.js handles `.docx` but not `.doc`. Most Brehob files are `.doc`.
Mixed `.doc` + `.docx` + `.pdf` in quotes	Each routes to correct parser by extension	Need robust extension detection; some files lie about type
Inconsistent CHARACTERISTICS blocks	Parse what's there; store raw block for embedding anyway	Real data is messy — key:value alignment varies by file
Price field variations	Parse `Net $X Each`, `Net $X`, `Net: $X`, `$X Each`	John's formatting drift over 22 years
Multi-option quotes (~50% of quotes)	Parse each option as separate quote grouped by `quote_number`	Complex quotes have "Option 1" / "Option 2" tabs or sections
Empty template fields (Pressure: [blank])	Store empty string; don't fail parse	Product spec templates have blanks filled per-quote
Duplicate customer quote files (rev1, rev2)	Keep latest `modifiedTime`; log others	Drive has `-pricing quote 1.xls`, `-pricing quote 1.xls`, etc.
Embedding API failure mid-batch	Retry with exponential backoff; resume from last successful row	OpenAI rate limits + transient errors

Testing Strategy

Test Layers

Layer	Applies?	Notes
Unit tests	Yes	Parsers are pure functions — test each format against fixture files
Integration tests	Yes	End-to-end: seed subset → run ingest → query DB → verify row counts + sample rows
Retrieval quality tests	Yes	The golden test. Given a known input scenario, do we retrieve the expected line items?
Visual tests	No	No UI in E1
E2E / manual	Yes	Eyeball 5 random line-item descriptions post-ingest; do they match the source doc?

Required Fixtures

Fixture Name	What It Tests	Priority
`fixtures/4m-industries.doc`	Multi-line-item parsing from the template reference	🔴 High
`fixtures/quincy-qmb30.doc`	Product spec extraction	🔴 High
`fixtures/slate-trucks.xlsx`	Excel pricing work sheet parsing (reference only, not for AI use)	🟡 Medium
`fixtures/golden-scenario.json`	Input: "100HP oilless compressor for food-grade plant." Expected: retrieve Groeb Farms, 4M, and Powerex SEQ1007 line items in top 5.	🔴 High

Verification Rules

Every parser has a fixture-driven unit test. No exceptions.
The golden test runs on every CI (once CI exists post-demo) — retrieval regression is the most important signal.
Line-item count spot check — after ingest, SELECT COUNT(*) FROM quote_line_items must be within expected range (~100-200 for demo subset).

Stories

Story	Summary	Status	PR
S0	Drive downloader — pull curated subset to `ingestion/cache/`	—	—
S1	Word parser (`.doc` + `.docx`) via textutil + mammoth fallback	—	—
S2	Excel parser (`.xlsx`) via SheetJS	—	—
S3	PDF parser via pdf-parse	—	—
S4	Haiku extractor scaffold — SDK client, tool-use mode, error handling	—	—
S5	Product spec extraction: Zod schema + prompt + eval fixtures	—	—
S6	Quote extraction: Zod schema + prompt + eval fixtures	—	—
S7	Line-item extraction: Zod schema + prompt + eval fixtures — the critical one	—	—
S8	Extraction eval runner (compare Haiku output vs hand-labeled fixtures)	—	—
S9	Embedder (OpenAI `text-embedding-3-small`, batched)	—	—
S10	Loader (pgvector inserts with UPSERT + hash-based re-embed detection)	—	—
S11	CLI integration (`pnpm ingest --subset=demo`) + golden retrieval test	—	—

Known Issues / Tech Debt

Issue	Severity	Notes
No incremental ingestion (full re-run each time)	🟡 Medium	Fine for demo; Full MVP should support incremental by hash
No monitoring of embedding costs	🟢 Low	Demo cost is negligible (<$1); add cost logging in Full MVP
Parsers coupled to Brehob format quirks	🟡 Medium	Expected — parsers generalize when extracted to Billet; in-repo keeps them pragmatic
No retry/queue for failed rows	🟡 Medium	Accept for demo; add a `.ingestion-errors.log` and move on

Open Questions

All E1 open questions resolved as of 2026-04-18. See Decisions Log for the locked-in outcomes. Add new questions here as they emerge during implementation.

Risks

Risk	Likelihood	Impact	Mitigation
Line-item extraction accuracy is <90% on the subset	Medium	🔴 High (demo value depends on this)	Iterate on the extractor against 5-10 hand-labeled fixtures before running on the full subset. Accept imperfection on outliers.
Retrieval quality is "meh" despite clean data	Medium	🔴 High	Build the golden test first, then iterate. Try hybrid search (BM25 + vector) if pure vector underperforms.
`.doc` parsing has edge cases we don't anticipate	Medium	🟡 Medium	Spot-check 5 random files after ingest; fall back to raw-text-only storage for files the parser chokes on (still embeddable, just no structured fields)
OpenAI embedding API changes or is rate-limited	Low	Medium	Batch + retry with backoff. If needed, fall back to local ONNX embeddings (same model Anvil uses)
Curated subset is too small to demonstrate John's voice	Medium	🟡 Medium	20 quotes covers 10+ years of John's language. If it feels thin post-ingest, expand to 40 before the demo.

Decisions Log

Date	Decision	Rationale	Alternatives Considered
2026-04-18	Parsers structured as Billet-shaped modules in-repo	Per main design doc's compounding decision — QuoteAI informs Billet v0	Depend on Billet (rejected: Billet isn't built), hand-roll without structure (rejected: no compounding)
2026-04-18	Line-item embeddings as atomic retrieval unit	Per main design doc — "individual LINE ITEM DESCRIPTIONS are the reusable atoms"	Quote-level only (rejected: too coarse), paragraph-level (rejected: arbitrary boundaries)
2026-04-18	Curated subset — minimal curation, iterate as needed	Grab ~15-20 docs with a few known-good picks (4M reference, John-flagged examples, recent 2025-2026 customer quotes, rest random). Pipeline supports incremental additions natively via hash-based re-embed detection. Iterate when the golden test reveals gaps.	Rigid tier-based criteria (rejected: premature curation), full-corpus ingest (rejected: slow feedback)
2026-04-18	OpenAI `text-embedding-3-small`	Per main design doc; Claymore already has key; 1536-dim matches schema	Local ONNX (retained as fallback), Voyage/Cohere (rejected: another vendor)
2026-04-18	Hash-based re-embed detection	Cheap to add, saves cost on re-runs	Always re-embed (rejected: wasteful)
2026-04-18	Golden test is the primary quality signal	Retrieval quality IS the product	Rely on eyeballing only (rejected: no regression protection)
2026-04-18	Haiku-based structured extraction (not regex)	.doc files over 22 years have too much format drift for brittle regex. Haiku with Zod tool-use handles variation gracefully. ~$0.01/doc, Claude Max absorbs dev cost. Pattern compounds: same idea could land in Billet v2.	Regex-only (rejected: brittle, see 22 years of format drift), Sonnet (rejected: overkill for structured extraction), LLM without schema validation (rejected: need Zod gate before DB writes)
2026-04-18	Embed verbatim description blocks, NOT Haiku-paraphrased text	John's exact phrasing IS the moat. Haiku identifies boundaries and structured fields; it does not rewrite descriptions.	Embed Haiku summaries (rejected: loses the source-language advantage)
2026-04-18	Continue-on-error strategy for extraction failures	One broken file shouldn't stop the whole ingest. Log to `errors.log`, move on, fix on next run.	Fail-fast (rejected: painful on 40-doc runs)
2026-04-18	Versioned prompts with `extractor_version` in DB	Enables A/B comparison when we iterate on extraction prompts	Unversioned prompts (rejected: can't detect regressions)
2026-04-18	Golden test scenario: "100HP oilless compressor for food-grade plant"	Expected retrievals (top 5) include Groeb Farms, 4M Industries, Powerex SEQ1007 line items. Gate: E1 cannot ship unless this passes. Same scenario also runs end-to-end through E2 + E3 as the demo walkthrough.	Generic sanity query (rejected: no coverage of John's distinctive language); multi-scenario battery (deferred: one scenario first, add more after demo)
2026-04-18	Schema-versioning + hash columns folded into E0's initial schema (no mid-epic migration)	Adds `extractor_version TEXT`, `source_hash TEXT`, `content_hash TEXT` to `products`, `past_quotes`, `quote_line_items` in `001_init.sql`. Cheaper to land now (no data yet) than to migrate after E1 ingests. Unblocks Open Q #6 (hash strategy) by committing to both hash columns.	Post-E1 migration (rejected: more pain, more risk); separate versioning table (rejected: over-engineered)
2026-04-18	Drive access: manual download (Option A)	Dan manually pulls the curated subset to `ingestion/cache/`; CLI processes local copies. Drive MCP can't be called from a Node CLI (only MCP clients). Zero setup cost, ~20 files takes minutes. Full MVP will upgrade to Option B (Drive API with OAuth) when automation matters.	Drive API with OAuth (deferred to Full MVP: credential setup overhead not justified for demo); CC-orchestrated ingest (rejected: awkward architecture split)
2026-04-18	Extraction eval fixtures: 5-10 hand-labeled pairs	2 simple quotes + 3 standard + 2 complex + 2-3 product specs. Each = source `.doc` + expected JSON. Runner diffs Haiku output vs expected, reports deltas. Unit-test-grade coverage for the extractor; catches regressions when prompts evolve.	Fewer (rejected: false confidence); many more (rejected: tedious to maintain); eyeball-only (rejected: no regression signal)
2026-04-18	Vendor quotes in MVP ingestion	Ingest 3-5 vendor quotes (e.g., US Wire Rope installation quote) referenced in the curated past-quote subset. Small scope add. Improves install-section description quality since the 4M template pulls language from vendor install quotes.	Skip vendor quotes until Full MVP (rejected: weakens demo's install-section output)
2026-04-18	End-to-end benchmark quotes (distinct from golden retrieval test)	Pick 2-3 finished Brehob quotes (4M Industries, Bowen Engineering, Munson Hospital as candidates) as system-output ground truth. Given the same form inputs, compare generated draft against benchmark for description fidelity, formatting, completeness. Complementary to golden retrieval test: this tests E1+E2+E3 together. Lives in E3 Testing Strategy.	Golden retrieval test only (rejected: doesn't test assembled output quality)

E1: Ingestion + Vector DB — Epic Design Doc#

Overview#

Goals & Non-Goals#

Problem Statement#

What Is This Epic?#

Context#

Dependents#

Dependencies#

Current State#

Affected Systems#

Design#

Module Structure (Billet-Shaped)#

Curated Subset Selection#

Key Algorithms / Logic#

Line-Item Description Extraction (Critical)#

Embedding Strategy#

Data Model Changes#

Edge Cases & Gotchas#

Testing Strategy#

Test Layers#

Required Fixtures#

Verification Rules#

Stories#

Known Issues / Tech Debt#

Open Questions#

Risks#

Decisions Log#

Review