Foundry Foundry

E1: Ingestion + Vector DB — Epic Design Doc

Status: 🔄 In Refinement (Step 0) Authors: Dan Hannah & Clay Created: 2026-04-18 Parent: QuoteAI Project Design Doc


Overview

Goals & Non-Goals

Goals:

  • Pull a curated subset of Brehob source docs from Google Drive
  • Convert .doc / .xlsx / .pdf → clean structured data using Billet-shaped parser modules
  • Extract structured fields for products, quotes, and line items (the atomic retrieval unit)
  • Generate embeddings for products, quotes, and individual line-item descriptions
  • Load everything into the pgvector schema created in E0
  • Provide a reproducible ingestion CLI (pnpm ingest) so we can rebuild the DB from scratch

Non-Goals:

  • No full-corpus ingestion — curated ~20 quotes + ~20 product specs for demo
  • No real-time sync with Drive — one-shot ingestion triggered manually
  • No admin UI — CLI only
  • No vendor quote ingestion — manual upload deferred to Full MVP
  • No pricing sheet ingestion (except as text reference — not for AI pricing, per design doc)

Problem Statement

QuoteAI's core value is retrieval quality on John's 22-year description library. Without real data in the DB, the MCP tools (E2) have nothing to query and the UI (E3) has nothing to assemble. Worse: bad retrieval = bad demo = no customer. The ingestion pipeline IS the product's moat.

The data is messy (.doc files from 2008-2025, inconsistent spreadsheet layouts, varying quote structures), lives in Google Drive, and needs structured field extraction plus embedding generation before it's useful.

What Is This Epic?

A reproducible ingestion pipeline that takes raw Drive documents and produces a populated pgvector database with:

  • Product catalog — structured specs from .doc spec sheets
  • Past quotes — text + embeddings for semantic search over full quotes
  • Line items — individually embedded descriptions, the atomic retrieval unit per the design doc
  • Vendor quotes — text + embeddings (reference material)

Parsers are Billet-shaped (per the main design doc's Billet/Anvil decision) so they can be extracted into Billet v0 later.


Context

Dependents

  • E2 (MCP Servers) — every MCP tool queries data loaded by E1
  • Demo experience — retrieval quality is the demo's single biggest make-or-break

Dependencies

  • E0 (Foundation) — schema must exist, DB must be runnable
  • Google Drive MCP — already connected and working (confirmed in this session)
  • OpenAI API key — for text-embedding-3-small (per design doc)

Current State

Source data lives in QuoteAI - Brehob Quote History Google Drive folder (owner: johnhannah0624@gmail.com, shared with Dan). No local copy, no DB rows, no embeddings.

Affected Systems

System / LayerHow It's Affected
ingestion/ directoryFully populated with parsers, embedder, loader
PostgresAll tables populated with the curated subset
OpenAI APIBilled for embedding generation (expect <$1 for ~40 docs × a few hundred line items)
Google DriveRead-only access via MCP

Design

Module Structure (Billet-Shaped)

ingestion/
├── download.ts              # Pull curated file list from Drive
├── parsers/                 # Format conversion ONLY (Billet-shaped)
│   ├── word.ts              # textutil / mammoth — .doc / .docx → clean text
│   ├── excel.ts             # SheetJS — .xlsx → markdown tables
│   ├── pdf.ts               # pdf-parse → text
│   └── types.ts             # shared ParsedDoc interface
├── extractor/               # Haiku-based structured extraction (QuoteAI-specific)
│   ├── extractor.ts         # invoke Haiku with text + schema, validate output
│   ├── schemas.ts           # Zod schemas for product / quote / line-item
│   ├── prompts.ts           # extraction prompts (versioned, hand-tuned)
│   └── evals.ts             # hand-labeled fixture runner
├── embed.ts                 # OpenAI text-embedding-3-small
├── load.ts                  # pgvector inserts with UPSERT
├── cli.ts                   # `pnpm ingest --subset=demo` entry point
└── fixtures/
    ├── demo-subset.json     # curated file IDs for demo
    └── eval-labeled/        # hand-labeled extraction targets

Separation of concerns (post-Haiku-extractor decision):

LayerJobScope
ParserFormat bytes → clean textBillet-shaped. Extractable to Billet v0 later.
ExtractorText → structured JSON (Zod-validated)QuoteAI-specific. Uses Haiku.
EmbedderStructured fields + description blocks → vectorsOpenAI text-embedding-3-small.
LoaderRows → Postgres with UPSERTStraightforward CRUD.

Why this shape: Parsers stay dumb and reusable — exactly what Billet's adapter pattern wants. The intelligence (knowing what a quote_line_item looks like, what a product row contains) lives in QuoteAI's extractor, where it belongs. When Billet ships, parsers/ extracts cleanly. The extractor stays here since it's schema-specific to QuoteAI.

Curated Subset Selection

~20 quotes + ~20 product specs, covering:

Product specs (by manufacturer):

  • Quincy: QMB30, QSI-100, QGV-100, QSV-430, QT-10, QOF-400
  • Zeks: 500HSF, ZFC filters, refrigerated dryer
  • Hankison: 505BC, refrigerated dryer
  • Powerex: oilless scroll, vacuum
  • Dekker: vacuum
  • CRP: oil-water separator

Past quotes (by complexity tier):

  • Simple (5): Single-product + price — e.g., Henry Ford refurbished dryer
  • Standard (10): Multi-line system — e.g., Bowen Engineering 15hp duplex, Corwell Health Royal Oak, Purforms 50hp
  • Complex (5): Multi-option, controls, installation — e.g., 4M Industries (our template reference), Slate Trucks 6-Turbo, Munson Hospital

Criteria: Coverage of complexity tiers, representative manufacturers, at least 2 multi-option quotes, the 4M reference for template extraction, John's language across ≥10 years.

Stored in ingestion/fixtures/demo-subset.json with Drive file IDs and metadata.

Key Algorithms / Logic

Line-Item Description Extraction (Critical)

Regex-based extraction of line-item blocks from 22-year-old .doc files is a losing game — formatting drift, multi-option variations, and inconsistent CHARACTERISTICS blocks all break brittle parsers. We use a Haiku agent with structured output instead.

Pipeline per document:

  1. Parser converts .doc / .xlsx / .pdf → clean text (Billet-shaped)
  2. Extractor invokes Haiku with a document-type-specific prompt + Zod input schema (via tool-use mode)
  3. Haiku returns structured JSON matching the schema:
    • Product specs: manufacturer, series/model, cooling, pressure, CFM, HP, voltage, drive, dimensions, weight, noise_dba, raw description block
    • Past quotes: customer info, date, quote number, line items (nested), totals, scope of work (if present)
    • Line items: category (ROTARY SCREW AIR COMPRESSOR etc.), verbatim description block, qty, specs, unit price, delivery text
  4. Zod validates the response; schema violations fail loud with the raw response logged
  5. On validation failure: retry once with an error-repair prompt; if still failing, write to errors.log and move on (continue-on-error for demo)

Why Haiku specifically:

  • Cheap: ~$0.01/doc × 40 docs = ~$0.40 for the full demo subset
  • Structured output via tool-use is reliable for this pattern
  • Handles format drift gracefully where regex would break
  • Dan's Claude Max plan absorbs dev iteration cost

What gets embedded (critical):

  • Embed verbatim line-item description blocks from the source — NOT Haiku's paraphrase
  • John's exact phrasing is the moat. Haiku's job is to identify block boundaries and structured fields, not to rewrite descriptions
  • Each quote_line_items.embedding captures the source language that will eventually be reused by Claude Code when assembling future drafts

Prompt versioning:

  • Prompts live in extractor/prompts.ts with version numbers
  • Every extraction stores extractor_version in the row metadata (add to products, past_quotes, quote_line_items if not already there — small migration)
  • Enables re-extraction with a new prompt and A/B comparison against baseline

Embedding Strategy

EntityEmbedded?Source TextPurpose
products.embeddingYesFull spec sheet text"What products match these specs?"
past_quotes.embeddingYesFull quote text + customer + summary"Find quotes similar to this overall project"
quote_line_items.embeddingYesIndividual line-item description blockThe atomic retrieval unit — "Find the best description language for this config"
vendor_quotes.embeddingYesFull vendor quote textReference retrieval

Batching: OpenAI allows ~2048 strings per request. Line items for 20 quotes ≈ 100-200 items total — one batch call.

Re-embed detection: Hash the source text; store hash in DB. On re-ingest, only re-embed changed hashes. (Not strictly needed for demo but cheap to add.)

Data Model Changes

No schema changes — E0 created everything. E1 populates:

  • products
  • past_quotes
  • quote_line_items
  • vendor_quotes

Leaves untouched:

  • quote_log, quote_log_items, feedback (Full MVP)

Edge Cases & Gotchas

ScenarioExpected BehaviorWhy It's Tricky
.doc files (Office 97-2003)Parse via textutil on macOS, fallback to antiword or cloud convertmammoth.js handles .docx but not .doc. Most Brehob files are .doc.
Mixed .doc + .docx + .pdf in quotesEach routes to correct parser by extensionNeed robust extension detection; some files lie about type
Inconsistent CHARACTERISTICS blocksParse what's there; store raw block for embedding anywayReal data is messy — key:value alignment varies by file
Price field variationsParse Net $X Each, Net $X, Net: $X, $X EachJohn's formatting drift over 22 years
Multi-option quotes (~50% of quotes)Parse each option as separate quote grouped by quote_numberComplex quotes have "Option 1" / "Option 2" tabs or sections
Empty template fields (Pressure: [blank])Store empty string; don't fail parseProduct spec templates have blanks filled per-quote
Duplicate customer quote files (rev1, rev2)Keep latest modifiedTime; log othersDrive has -pricing quote 1.xls, -pricing quote 1.xls, etc.
Embedding API failure mid-batchRetry with exponential backoff; resume from last successful rowOpenAI rate limits + transient errors

Testing Strategy

Test Layers

LayerApplies?Notes
Unit testsYesParsers are pure functions — test each format against fixture files
Integration testsYesEnd-to-end: seed subset → run ingest → query DB → verify row counts + sample rows
Retrieval quality testsYesThe golden test. Given a known input scenario, do we retrieve the expected line items?
Visual testsNoNo UI in E1
E2E / manualYesEyeball 5 random line-item descriptions post-ingest; do they match the source doc?

Required Fixtures

Fixture NameWhat It TestsPriority
fixtures/4m-industries.docMulti-line-item parsing from the template reference🔴 High
fixtures/quincy-qmb30.docProduct spec extraction🔴 High
fixtures/slate-trucks.xlsxExcel pricing work sheet parsing (reference only, not for AI use)🟡 Medium
fixtures/golden-scenario.jsonInput: "100HP oilless compressor for food-grade plant." Expected: retrieve Groeb Farms, 4M, and Powerex SEQ1007 line items in top 5.🔴 High

Verification Rules

  1. Every parser has a fixture-driven unit test. No exceptions.
  2. The golden test runs on every CI (once CI exists post-demo) — retrieval regression is the most important signal.
  3. Line-item count spot check — after ingest, SELECT COUNT(*) FROM quote_line_items must be within expected range (~100-200 for demo subset).

Stories

StorySummaryStatusPR
S0Drive downloader — pull curated subset to ingestion/cache/
S1Word parser (.doc + .docx) via textutil + mammoth fallback
S2Excel parser (.xlsx) via SheetJS
S3PDF parser via pdf-parse
S4Haiku extractor scaffold — SDK client, tool-use mode, error handling
S5Product spec extraction: Zod schema + prompt + eval fixtures
S6Quote extraction: Zod schema + prompt + eval fixtures
S7Line-item extraction: Zod schema + prompt + eval fixtures — the critical one
S8Extraction eval runner (compare Haiku output vs hand-labeled fixtures)
S9Embedder (OpenAI text-embedding-3-small, batched)
S10Loader (pgvector inserts with UPSERT + hash-based re-embed detection)
S11CLI integration (pnpm ingest --subset=demo) + golden retrieval test

Known Issues / Tech Debt

IssueSeverityNotes
No incremental ingestion (full re-run each time)🟡 MediumFine for demo; Full MVP should support incremental by hash
No monitoring of embedding costs🟢 LowDemo cost is negligible (<$1); add cost logging in Full MVP
Parsers coupled to Brehob format quirks🟡 MediumExpected — parsers generalize when extracted to Billet; in-repo keeps them pragmatic
No retry/queue for failed rows🟡 MediumAccept for demo; add a .ingestion-errors.log and move on

Open Questions

All E1 open questions resolved as of 2026-04-18. See Decisions Log for the locked-in outcomes. Add new questions here as they emerge during implementation.

Risks

RiskLikelihoodImpactMitigation
Line-item extraction accuracy is <90% on the subsetMedium🔴 High (demo value depends on this)Iterate on the extractor against 5-10 hand-labeled fixtures before running on the full subset. Accept imperfection on outliers.
Retrieval quality is "meh" despite clean dataMedium🔴 HighBuild the golden test first, then iterate. Try hybrid search (BM25 + vector) if pure vector underperforms.
.doc parsing has edge cases we don't anticipateMedium🟡 MediumSpot-check 5 random files after ingest; fall back to raw-text-only storage for files the parser chokes on (still embeddable, just no structured fields)
OpenAI embedding API changes or is rate-limitedLowMediumBatch + retry with backoff. If needed, fall back to local ONNX embeddings (same model Anvil uses)
Curated subset is too small to demonstrate John's voiceMedium🟡 Medium20 quotes covers 10+ years of John's language. If it feels thin post-ingest, expand to 40 before the demo.

Decisions Log

DateDecisionRationaleAlternatives Considered
2026-04-18Parsers structured as Billet-shaped modules in-repoPer main design doc's compounding decision — QuoteAI informs Billet v0Depend on Billet (rejected: Billet isn't built), hand-roll without structure (rejected: no compounding)
2026-04-18Line-item embeddings as atomic retrieval unitPer main design doc — "individual LINE ITEM DESCRIPTIONS are the reusable atoms"Quote-level only (rejected: too coarse), paragraph-level (rejected: arbitrary boundaries)
2026-04-18Curated subset — minimal curation, iterate as neededGrab ~15-20 docs with a few known-good picks (4M reference, John-flagged examples, recent 2025-2026 customer quotes, rest random). Pipeline supports incremental additions natively via hash-based re-embed detection. Iterate when the golden test reveals gaps.Rigid tier-based criteria (rejected: premature curation), full-corpus ingest (rejected: slow feedback)
2026-04-18OpenAI text-embedding-3-smallPer main design doc; Claymore already has key; 1536-dim matches schemaLocal ONNX (retained as fallback), Voyage/Cohere (rejected: another vendor)
2026-04-18Hash-based re-embed detectionCheap to add, saves cost on re-runsAlways re-embed (rejected: wasteful)
2026-04-18Golden test is the primary quality signalRetrieval quality IS the productRely on eyeballing only (rejected: no regression protection)
2026-04-18Haiku-based structured extraction (not regex).doc files over 22 years have too much format drift for brittle regex. Haiku with Zod tool-use handles variation gracefully. ~$0.01/doc, Claude Max absorbs dev cost. Pattern compounds: same idea could land in Billet v2.Regex-only (rejected: brittle, see 22 years of format drift), Sonnet (rejected: overkill for structured extraction), LLM without schema validation (rejected: need Zod gate before DB writes)
2026-04-18Embed verbatim description blocks, NOT Haiku-paraphrased textJohn's exact phrasing IS the moat. Haiku identifies boundaries and structured fields; it does not rewrite descriptions.Embed Haiku summaries (rejected: loses the source-language advantage)
2026-04-18Continue-on-error strategy for extraction failuresOne broken file shouldn't stop the whole ingest. Log to errors.log, move on, fix on next run.Fail-fast (rejected: painful on 40-doc runs)
2026-04-18Versioned prompts with extractor_version in DBEnables A/B comparison when we iterate on extraction promptsUnversioned prompts (rejected: can't detect regressions)
2026-04-18Golden test scenario: "100HP oilless compressor for food-grade plant"Expected retrievals (top 5) include Groeb Farms, 4M Industries, Powerex SEQ1007 line items. Gate: E1 cannot ship unless this passes. Same scenario also runs end-to-end through E2 + E3 as the demo walkthrough.Generic sanity query (rejected: no coverage of John's distinctive language); multi-scenario battery (deferred: one scenario first, add more after demo)
2026-04-18Schema-versioning + hash columns folded into E0's initial schema (no mid-epic migration)Adds extractor_version TEXT, source_hash TEXT, content_hash TEXT to products, past_quotes, quote_line_items in 001_init.sql. Cheaper to land now (no data yet) than to migrate after E1 ingests. Unblocks Open Q #6 (hash strategy) by committing to both hash columns.Post-E1 migration (rejected: more pain, more risk); separate versioning table (rejected: over-engineered)
2026-04-18Drive access: manual download (Option A)Dan manually pulls the curated subset to ingestion/cache/; CLI processes local copies. Drive MCP can't be called from a Node CLI (only MCP clients). Zero setup cost, ~20 files takes minutes. Full MVP will upgrade to Option B (Drive API with OAuth) when automation matters.Drive API with OAuth (deferred to Full MVP: credential setup overhead not justified for demo); CC-orchestrated ingest (rejected: awkward architecture split)
2026-04-18Extraction eval fixtures: 5-10 hand-labeled pairs2 simple quotes + 3 standard + 2 complex + 2-3 product specs. Each = source .doc + expected JSON. Runner diffs Haiku output vs expected, reports deltas. Unit-test-grade coverage for the extractor; catches regressions when prompts evolve.Fewer (rejected: false confidence); many more (rejected: tedious to maintain); eyeball-only (rejected: no regression signal)
2026-04-18Vendor quotes in MVP ingestionIngest 3-5 vendor quotes (e.g., US Wire Rope installation quote) referenced in the curated past-quote subset. Small scope add. Improves install-section description quality since the 4M template pulls language from vendor install quotes.Skip vendor quotes until Full MVP (rejected: weakens demo's install-section output)
2026-04-18End-to-end benchmark quotes (distinct from golden retrieval test)Pick 2-3 finished Brehob quotes (4M Industries, Bowen Engineering, Munson Hospital as candidates) as system-output ground truth. Given the same form inputs, compare generated draft against benchmark for description fidelity, formatting, completeness. Complementary to golden retrieval test: this tests E1+E2+E3 together. Lives in E3 Testing Strategy.Golden retrieval test only (rejected: doesn't test assembled output quality)

Review

🔒

Enter your access token to view annotations