Epic: Ingestion Foundation

Roadmap item 1 of brehob-launch · the shared substrate everything else consumes · absorbs the former Gate-0 spike — the eval harness is the acceptance gate on this build, not a throwaway before it · target: DONE before M3 (Jul 20–31, "Document ingestion & KB prep"), built on the pre-kickoff runway (now → Jul 6) + kickoff week · → North Star B1.

Refined to story-level detail 2026-06-15. Architecture is locked in sub-systems/ingestion-pipeline (OD1–OD12) — this epic is the work breakdown, not the design; do not re-litigate the ODs here. Code references verified against the repo 2026-06-15.

Objective

Make Autri's ingestion pipeline handle the real world: legacy formats arrive, get converted server-side, route down the right path (structured skips normalization; unstructured normalizes to markdown — DB-3), and produce retrievable chunks carrying both lexical anchors and typed attributes — then expose a filter-then-rank operator over those attributes (Autri's one missing retrieval primitive, serving Brehob spec-match and dev-memory recency). Fail loudly and visibly, and do all LLM-touched work at Batch economics.

This epic absorbs the former Gate-0 spike: the eval harness is the acceptance gate on the build. Pass bars are authored before any code (S0); the early go/no-go read Gate-0a was meant to give (conversion approach, table fidelity, regime mix, $/doc) is the first measured checkpoint after S1–S2 — production code, kept on a "go", same Plan B on a no-go (conversion-approach rework + heavier curation + spend the +3-week extension; the signed deal is not at risk).

Work breakdown

Nine stories, dependency-ordered. Each carries: deliverable / mechanics (code-grounded) / depends-on / measured acceptance / rough size. The architecture (storage shape, operator surface, extraction routing, schema lifecycle) is settled in the sub-system doc — cited as OD#, not re-opened. Old→new ID map from the 6/15 skeleton: old S3 split into S3 (schema substrate) + S4 (extraction); old S4 batch → S7; old S5 failure → S8; old S6 operator → S5; S6 (curation UI) is new.

S0 — Eval gold + pass bars (measure-first gate).

Deliverable: a versioned gold set + written numeric pass bars, committed before any build. Three parts: (a) query gold — quote-scenario queries in per-index forms (NL→vector, keyword→FTS, attribute-filter→the new operator) with labeled relevant chunks; (b) attribute-extraction gold — hand-labeled correct typed values per doc (hp/cfm/psi/lubrication/date); (c) pass bars — extraction accuracy ≥ X%, recall@k ≥ Y per index/question-type, one-time $/doc ≤ Z, recurring $/query ≤ W.
Mechanics: authored under the eval-pipeline skill discipline (per-index *.queries.json, baseline-first, significance over point estimates). Dan authors the query gold (~half a day — what a salesperson actually asks); AI authors the per-index forms + labels the attribute gold. Seeded from the Slate Trucks pair (recent + complete: spreadsheet + proposal). Harness is local-only (DATABASE_URL from .env.local), runs the real retrieval code.
Depends on: nothing (Dan-authored). Gates the acceptance of S2/S4/S5.
Acceptance: gold committed; pass bars are numbers, written down, agreed by Dan before S1 code lands.
Size: ~1 day (Dan ~0.5 + AI ~0.5–1).

S1 — Conversion stage, productionized.

Deliverable: server-side conversion of legacy formats → markdown, mounted in the ingestion path. .xlsx/.xls via the ported SheetJS parser; .docx via mammoth; .doc/.rtf via the chosen server-side converter; unknown extension → typed "unsupported" failure, not a crash.
Mechanics: port quoteai/ingestion/parsers/excel.ts (parseExcel(filePath, opts) → ParsedDoc; SheetJS; .xlsx+.xls; sheetToMarkdown; caps MAX_ROWS=500 / MAX_COLS=30) into Autri's server-side worker — pure JS, ports cleanly. The .doc/.rtf path is the real infra decision (former Gate-0a ①): container-image Lambda vs Fargate task running LibreOffice — LibreOffice has known Lambda traps (read-only filesystem / /tmp-only, layer size); the parity + cost comparison decides. (QuoteAI's textutil macOS path cannot run server-side.)
Depends on: nothing. Acceptance scored vs S0's conversion-fidelity rubric.
Acceptance: the Slate Trucks spreadsheet + a real … - Final.doc convert at the S0 conversion-fidelity bar (rows/headings preserved, junk rate under bar). ⚠️ Verify the Slate Trucks sheet against the 500-row cap — a master price list could exceed it and silently truncate line-items.
Size: ~2–4 days (the container/Fargate infra is the meat).

S2 — Two-path routing + table→row-chunk→cell-attribute.

Deliverable: converted input routes correctly (headingless prose-dumps → Haiku grouping, not bad-deterministic; pipe-tables → row-granular chunks carrying cell positions). This adds the table-aware chunking the chunker lacks today.
Mechanics: routing lives in ingestion/extractor/route.ts (STRUCTURED markdown → deterministic-chunk.ts; PROSE + low-coverage STRUCTURED → LLM). Tables today land wholesale as chunk_type:'table' single rows (extractor.ts ~595–613) — no cell parsing. Add table-grid handling: parse the pipe-table header → columns; each row → its own chunk (retrievable as "the 45hp Powerex line item + its price"), carrying cell values keyed by column for S4's deterministic cell/grid extraction. Display stays whole-table (retrieval ≠ display granularity). Verify on S1's real converted output (uglier than authored markdown).
Depends on: S1. ✅ De-risked early (6/15): an early wave-1 spike (thin excel.ts port + grid-parse on the Slate Trucks sheet) tests the row-granular grid-parse before S3/S4 commit — see Dependencies & execution waves.
Acceptance: a Slate Trucks pricing table produces row-level chunks; the former Gate-0a table-fidelity question (②) clears its bar on the scorecard.
Size: ~2–3 days. Higher risk than its size implies — S4 deterministic extraction's strongest (free, exact) path leans on this; if it slips, extraction falls back to the LLM (cost + accuracy hit). The early spike is the mitigation.

S3 — Per-KB attribute schema + JSONB storage + typed indexes (the OD1 substrate). [new — split from old S3]

Deliverable: the storage/indexing substrate the attribute capability stands on — a per-KB declared attribute schema, JSONB attribute storage on chunks, typed indexes per declared field, write-time validation.
Mechanics (OD1/OD2 locked): add attributes_schema JSONB to knowledge_bases (today 002_kb_primitive.sql has only name/slug/description — the natural home). Add an attributes JSONB column to chunks (today 001_init.sql has no typed or keyword columns). On declaration/promote, create a typed composite expression-index per distinct declared field-name, keyed on (knowledge_base_id, expr) — e.g. CREATE INDEX CONCURRENTLY … ON chunks (knowledge_base_id, ((attributes->>'hp')::numeric)) (red-team 6/15: composite-per-field-name over per-KB-partial, for tenant scaling — see note). Write-time validation against the declared schema.
Depends on: nothing structurally — build parallel to S1/S2. Precondition for S4 (storage target) and S5 (filter target).
Acceptance: a declared schema persists; a chunk written with a typed attribute validates + lands in JSONB; EXPLAIN shows a WHERE plan using the composite index.
Size: ~2–3 days. ✅ Red-team resolved (6/15): per-field index DDL is owned by the runtime promote/declare routine (CREATE INDEX CONCURRENTLY IF NOT EXISTS, idempotent, online) — not the migration system, which owns only the static attributes / attributes_schema columns (forced by OD1's runtime-declared schemas). Index shape = composite (knowledge_base_id, expr) per distinct field-name, not one partial index per KB×field — this bounds index count by distinct field-names, not tenants × fields. Refines OD1's literal partial-index form; same intent (typed, KB-scoped filtering).

S4 — Extraction stage: per-chunk keyword metadata + typed-attribute extraction (deterministic-first, LLM-fallback). [old S3, minus storage]

Deliverable: one ingestion stage emitting, per chunk, (a) keyword/lexical metadata (fixes prose FTS=0) and (b) typed attributes targeting the declared schema. Keyword + attribute extraction are one stage.
Mechanics:
- Keyword side: generate salient lexical anchors per chunk; store in a keyword column folded into the FTS query. Note: FTS today is to_tsvector computed at query time (functional, not a stored column), so this = add the keyword column + a generated tsvector / expression index + widen retrieval/src/fts-search.ts and the lookup path to query it. Biggest retrieval lever: prose chunks otherwise return FTS=0 → hybrid collapses to vector-only on exactly the content that dominates Brehob + dev-memory.
- Attribute side (OD5/OD6, deterministic-first / LLM-fallback): deterministic family on machine-regular sources — cell-coordinate (S2's row chunks: column→attribute), table-grid parse, labeled-pattern regex; LLM-fallback on variable prose, piggybacked on the existing Haiku call in ingestion/extractor/sdk-client.ts (invokeSdkExtraction → anthropic().messages.create ~line 136; Haiku 4.5; versions chunk-grouping-v3 / -prose-v1) — no second pass. Extraction targets the declared schema (synonyms collapse: "sq ft"→square_footage); genuinely-new concepts → candidate flags (OD9), consumed by S6. The temporal attribute (date) is what S5's recency rank-boost keys on (M1).
Depends on: S2 (row-granular tables for cell/grid), S3 (schema to target + typed storage).
Acceptance (measured): re-run the scorecard pre/post on unstructured docs — hybrid-recall lift clears the noise floor (paired, significance not point estimates); extraction accuracy vs the S0 attribute gold clears its bar.
Size: ~3–4 days.

S5 — Filter-then-rank retrieval operator (the composable pre-filter — OD12). [old S6]

Deliverable: the fourth retrieval mode — a WHERE pre-stage on typed attributes (range / equality / set / date) that composes with the vector / FTS / recency rankers. Not a standalone 4th operator (OD12); hard filter then rank (OD4).
Mechanics: new code in retrieval/src/ alongside vector-search.ts / fts-search.ts / lookup-section.ts — hybrid_search is a type name only in types.ts (unimplemented); this fills it as a pre-filter, not a sibling. The filter prunes on S3's composite indexes; survivors rank by embedding distance or FTS rank; recency an optional boost. Omitting the filter = today's behavior (OD12). ✅ Recency model (resolved 6/15, M1): dev-memory's "newest wins, history visible" = the chunk-level date attribute (S4) + a recency rank-boost here — not supersession. Hard supersession stays document-level (superseded_at is on documents, not chunks — corrects OD7's basis; retrieval already honors it via includeSuperseded?). So S5 needs no new chunk-level temporal model.
Depends on: S3 (typed columns + indexes), S4 (data to filter).
Acceptance (measured): a structured-filter index added to the per-index scorecard; recall vs the S0 attribute-filter gold clears its bar. The concrete test — "10 ≤ hp ≤ 20 AND lubrication = oilless AND cfm ≥ 90, ranked by relevance" — returns correctly.
Size: ~2–3 days (recency is date-attribute + rank — no new temporal model).

S6 — Schema-curation UI: propose-and-curate bootstrap + promote-then-backfill (OD9/OD10). [new]

Deliverable: the human-in-the-loop schema lifecycle — bootstrap (LLM proposes a schema from the first docs; user accepts/renames/prunes), candidate review (promote a flagged attribute), promote-then-backfill (add field → create index → bounded single-attribute re-extract).
Mechanics (OD9/OD10 locked): propose-and-curate over S4's candidate flags; manual declaration also supported (system KBs like dev-memory declare directly). Promote routine = add to KB attributes_schema → create the S3 composite index → run a targeted single-attribute backfill (bounded; rides Batch S7). The backfill is a single-attribute re-extract and is self-contained — it does NOT need the deferred Incremental Re-Ingestion epic (that's for full doc re-upload / content-diff); stated so it isn't a hidden cross-epic dep.
Depends on: S3, S4 (and S7 for backfill economics).
Acceptance: bootstrap proposes a schema on a fresh KB; promote creates an index + backfills one attribute without full re-ingestion.
Size: ~3–5 days. 🔵 Blue-team trim-candidate: for Brehob go-live the KB schema can be declared manually (OD10 supports it) — the full propose-and-curate UI + promote-then-backfill is product-grade (north-star B-series) and deferrable post-go-live without blocking the Brehob path.

S7 — Batch ingestion. [old S4]

Deliverable: LLM-touched work routed through the Anthropic Message Batches API (~50% cheaper, ~2× faster — validated, never built).
Mechanics: BATCH_MULT = 0.5 exists at ingestion/pricing.ts:40; ZERO Batch calls in the codebase today. Three riders: Haiku vision extraction, Haiku prose grouping, keyword + attribute generation (S4). Carry-forward red-team targets (archived batch epic): poll-based progress UX (Batch has no webhooks), route LLM-routed-units only, structured-stays-sync.
Depends on: soft on S4 (the attribute/keyword rider); build parallel and wire.
Acceptance: the per-doc cost columns (migration 013) show the ~50% cut on a batch-routed ingest.
Size: ~2–3 days. 🔵 Trim-candidate ③ (program fallback list): defer post-go-live if item 1's early (curated-corpus-size × $/doc) read is modest.

S8 — Failure surfacing. [old S5]

Deliverable: a failed conversion/ingest is visible to the uploader — per-doc failure states in the UI, DLQ items traceable to a document + reason, continue-on-error preserved.
Mechanics: Brehob's legacy corpus guarantees conversion failures; silent loss is the worst outcome. DLQs were purged to a clean baseline 6/10 — every future item is a real signal. Surface per-doc state + reason; keep batch ingest continue-on-error.
Depends on: S1 (conversion failure modes), the ingest pipeline + UI.
Acceptance: one deliberately-poisoned doc fails visibly in the UI with a traceable reason; the rest of the batch completes.
Size: ~1–2 days.

Dependencies & execution waves

Topo order from the depends-on edges (story numbers are dependency-ordered, so they mostly flow forward):

Wave 1 (no upstream): S0 (gold), S1 (conversion), S3 (schema substrate) — plus an early S2 table-chunking spike (thin excel.ts port + grid-parse on the Slate Trucks sheet) to test the row-granular deterministic-extraction premise before S3/S4 commit (red-team 6/15). The spike feeds the S0–S2 checkpoint with real data. S7/S8 skeletons can also start.
Wave 2: S2 (needs S1) — the full table-aware chunker, building on what the wave-1 spike learned.
Wave 3: S4 (needs S2 + S3).
Wave 4: S5 (needs S3 + S4); S6 (needs S3 + S4 + S7).
Threaded throughout: S7 (parallel; wires into S4), S8 (parallel; wires into S1).

Critical-path spine: S1 → S2 → S4 → S5 (convert → tables → extract → operate) is the longest chain and the go-live-critical capability; S3 runs parallel but is required by S4/S5; S0 gates acceptance up front. The S0–S2 checkpoint (S0 bars + the S1/S2 conversion-table-$/doc read, seeded by the wave-1 spike) is the go/no-go + trim-decision point.

Sizing & trims

Story	Size (d)	Wave	Trim?
S0 gold + bars	~1	1	—
S1 conversion	2–4	1	—
S2 routing + tables	2–3	2 (spike in 1)	—
S3 schema substrate	2–3	1	—
S4 extraction	3–4	3	—
S5 filter-then-rank	2–3	4	—
S6 curation UI	3–5	4	🔵 manual-declare for Brehob → defer UI
S7 batch	2–3	threaded	🔵 fallback ③ → defer if $/doc modest
S8 failure surfacing	1–2	threaded	—

Estimates, to be pressure-tested at the S0–S2 checkpoint — they are not commitments. Full S0–S8 ≈ 18–28 focused days ≈ 4–6 eng-weeks. With the candidate trims (S6 → manual declaration, S7 → fallback ③, promote-then-backfill → post-go-live), the go-live-critical core ≈ 12–18 days ≈ 2.5–3.5 eng-weeks. Window: must finish before M3 (Jul 20–31) on solo bandwidth (+ weekly Brehob meetings + legal track + beta keep-alive) — tight. Trim policy (decided 6/15): reactive, not pre-committed — the trims above are candidates; the cut decision is made at the S0–S2 checkpoint on real sizing data (the wave-1 spike + S0–S2 measurements), not on these estimates.

Acceptance

The eval slice re-ingested through the production path end-to-end: legacy formats convert server-side (S1), regime routing correct + tables row-granular (S2), per-chunk keyword metadata + typed attributes present (S3/S4), the filter-then-rank operator answers attribute queries (S5), and one deliberately-poisoned doc fails visibly in the UI (S8). Batch (if in scope) shows the ~50% cost cut in the per-doc cost columns (S7). Ships behind the harness gate — an Eval run gates the merge.

Pass bars (LOCKED 2026-06-15, Dan)

Framed as trust thresholds (QuoteAI has a human approval step), not correctness guarantees.

Dimension	Bar
Extraction — specs (hp/cfm/psi, cell route)	≥ 98%
Extraction — prose attrs (lubrication, dates, LLM route)	≥ 90%
Vector recall@10 (semantic Qs)	≥ 0.90
FTS recall@5 (model/part-number Qs)	≥ 0.85
Structured-filter recall (attribute Qs)	≥ 0.95
Hybrid vs vector lift	positive + clears the noise floor (significance, not point estimate)
One-time $/doc (ingest)	≤ $0.25
Recurring $/query (retrieval)	≤ $0.05

Eval phasing (same corpus, escalating questions)

Phase 1 — Slate Trucks alone: proves extraction accuracy + operator mechanics. Recall here is a sanity check only — an n=1 small-corpus artifact, not a real grade (one candidate can't be discriminated).
Phase 2 — + 3–5 distractor quotes (blower-purge dryers, the Hankison fridge-dryer ladder, ideally a 2nd compressor at a different HP/lubrication): proves retrieval discrimination; the recall + filter bars become the real grade. ← this epic's true acceptance.
Phase 3 — end-to-end build + A/B (hold out the real Slate quote, build it from precedent quotes, judge generated-vs-real on specs / line-items / pricing / phrasing): the QuoteAI vertical's (roadmap item 4) acceptance, not S0's. Seeded by the Slate Trucks pair.

Regression (existing corpus, pre/post S4)

Re-run the full existing scorecard: vector + lookup stable (we add columns, don't re-chunk), FTS holds-or-lifts on prose (the keyword metadata is generic). No typed-attribute gold on novels — attributes are schema-gated (OD9); a schemaless KB stores none (the LLM may still propose candidates, but nothing typed is stored/filterable without a declared schema). Optional anti-overfit: declare a minimal date / section_id schema on one reg doc (FIA / Constitution) to validate the typed path on non-quote content.

Gold seed (Slate Trucks, extracted 2026-06-15)

Chorus 90: hp 1000, cfm 3840, psi 125, lubrication oilless (prose-only — the built-in LLM-path test), water-cooled, 6 units, $473,498/ea. HPB4300 Hankison dryer: cfm 4300, 6 units. Spreadsheet (cell route) + proposal (datasheet + prose) corroborate — one case exercises both extraction routes and cross-checks them. Slug brehob-proposals__slate-trucks.

References

Architecture (do not re-litigate): sub-systems/ingestion-pipeline — OD1 (JSONB + partial-indexes), OD4 (filter-then-rank), OD9/OD10 (schema lifecycle), OD12 (composable pre-filter).
Code (verified 6/15): retrieval operators retrieval/src/{vector-search,fts-search,lookup-section}.ts (+ types.ts hybrid_search stub); routing ingestion/extractor/route.ts; LLM call ingestion/extractor/sdk-client.ts (invokeSdkExtraction); chunks db/migrations/001_init.sql; KB db/migrations/002_kb_primitive.sql (superseded_at is here, document-level); cost db/migrations/013_ingest_cost.sql + 014_query_cost.sql; ingestion/pricing.ts (BATCH_MULT); Excel parser quoteai/ingestion/parsers/excel.ts.
Decisions: DB-3 (two paths, three regimes), D62/D63 (FTS=0 finding), chunking-search verdict (Haiku grouping kept). Eval discipline: eval-pipeline skill; significance-over-point-estimates.
Build-kickoff hygiene: run a precise deployed-vs-main diff on the ingestion path before S1 (deployed can lag main).
Pull back archive/epics/batch-ingestion (into S7) and archive/sub-systems/{eval-corpus-and-doe, pipeline-eval-harness, unified-chunking-markdown}.

Epic: Ingestion Foundation#

Objective#

Work breakdown#

Dependencies & execution waves#

Sizing & trims#

Acceptance#

Pass bars (LOCKED 2026-06-15, Dan)#

Eval phasing (same corpus, escalating questions)#

Regression (existing corpus, pre/post S4)#

Gold seed (Slate Trucks, extracted 2026-06-15)#

References#

Review