Epic: Ingestion Foundation
Roadmap item 1 of brehob-launch · the shared substrate everything else consumes · absorbs the former Gate-0 spike — the eval harness is the acceptance gate on this build, not a throwaway before it · target: DONE before M3 (Jul 20–31, "Document ingestion & KB prep"), built on the pre-kickoff runway (now → Jul 6) + kickoff week · → North Star B1.
Refined to story-level detail 2026-06-15. Architecture is locked in sub-systems/ingestion-pipeline (OD1–OD12) — this epic is the work breakdown, not the design; do not re-litigate the ODs here. Code references verified against the repo 2026-06-15.
Objective
Make Autri's ingestion pipeline handle the real world: legacy formats arrive, get converted server-side, route down the right path (structured skips normalization; unstructured normalizes to markdown — DB-3), and produce retrievable chunks carrying both lexical anchors and typed attributes — then expose a filter-then-rank operator over those attributes (Autri's one missing retrieval primitive, serving Brehob spec-match and dev-memory recency). Fail loudly and visibly, and do all LLM-touched work at Batch economics.
This epic absorbs the former Gate-0 spike: the eval harness is the acceptance gate on the build. Pass bars are authored before any code (S0); the early go/no-go read Gate-0a was meant to give (conversion approach, table fidelity, regime mix, $/doc) is the first measured checkpoint after S1–S2 — production code, kept on a "go", same Plan B on a no-go (conversion-approach rework + heavier curation + spend the +3-week extension; the signed deal is not at risk).
Work breakdown
Nine stories, dependency-ordered. Each carries: deliverable / mechanics (code-grounded) / depends-on / measured acceptance / rough size. The architecture (storage shape, operator surface, extraction routing, schema lifecycle) is settled in the sub-system doc — cited as OD#, not re-opened. Old→new ID map from the 6/15 skeleton: old S3 split into S3 (schema substrate) + S4 (extraction); old S4 batch → S7; old S5 failure → S8; old S6 operator → S5; S6 (curation UI) is new.
S0 — Eval gold + pass bars (measure-first gate).
- Deliverable: a versioned gold set + written numeric pass bars, committed before any build. Three parts: (a) query gold — quote-scenario queries in per-index forms (NL→vector, keyword→FTS, attribute-filter→the new operator) with labeled relevant chunks; (b) attribute-extraction gold — hand-labeled correct typed values per doc (hp/cfm/psi/lubrication/date); (c) pass bars — extraction accuracy ≥ X%, recall@k ≥ Y per index/question-type, one-time $/doc ≤ Z, recurring $/query ≤ W.
- Mechanics: authored under the
eval-pipelineskill discipline (per-index*.queries.json, baseline-first, significance over point estimates). Dan authors the query gold (~half a day — what a salesperson actually asks); AI authors the per-index forms + labels the attribute gold. Seeded from the Slate Trucks pair (recent + complete: spreadsheet + proposal). Harness is local-only (DATABASE_URLfrom.env.local), runs the real retrieval code. - Depends on: nothing (Dan-authored). Gates the acceptance of S2/S4/S5.
- Acceptance: gold committed; pass bars are numbers, written down, agreed by Dan before S1 code lands.
- Size: ~1 day (Dan ~0.5 + AI ~0.5–1).
S1 — Conversion stage, productionized.
- Deliverable: server-side conversion of legacy formats → markdown, mounted in the ingestion path.
.xlsx/.xlsvia the ported SheetJS parser;.docxvia mammoth;.doc/.rtfvia the chosen server-side converter; unknown extension → typed "unsupported" failure, not a crash. - Mechanics: port
quoteai/ingestion/parsers/excel.ts(parseExcel(filePath, opts) → ParsedDoc; SheetJS;.xlsx+.xls;sheetToMarkdown; caps MAX_ROWS=500 / MAX_COLS=30) into Autri's server-side worker — pure JS, ports cleanly. The.doc/.rtfpath is the real infra decision (former Gate-0a ①): container-image Lambda vs Fargate task running LibreOffice — LibreOffice has known Lambda traps (read-only filesystem //tmp-only, layer size); the parity + cost comparison decides. (QuoteAI'stextutilmacOS path cannot run server-side.) - Depends on: nothing. Acceptance scored vs S0's conversion-fidelity rubric.
- Acceptance: the Slate Trucks spreadsheet + a real
… - Final.docconvert at the S0 conversion-fidelity bar (rows/headings preserved, junk rate under bar). ⚠️ Verify the Slate Trucks sheet against the 500-row cap — a master price list could exceed it and silently truncate line-items. - Size: ~2–4 days (the container/Fargate infra is the meat).
S2 — Two-path routing + table→row-chunk→cell-attribute.
- Deliverable: converted input routes correctly (headingless prose-dumps → Haiku grouping, not bad-deterministic; pipe-tables → row-granular chunks carrying cell positions). This adds the table-aware chunking the chunker lacks today.
- Mechanics: routing lives in
ingestion/extractor/route.ts(STRUCTURED markdown →deterministic-chunk.ts; PROSE + low-coverage STRUCTURED → LLM). Tables today land wholesale aschunk_type:'table'single rows (extractor.ts~595–613) — no cell parsing. Add table-grid handling: parse the pipe-table header → columns; each row → its own chunk (retrievable as "the 45hp Powerex line item + its price"), carrying cell values keyed by column for S4's deterministic cell/grid extraction. Display stays whole-table (retrieval ≠ display granularity). Verify on S1's real converted output (uglier than authored markdown). - Depends on: S1. ✅ De-risked early (6/15): an early wave-1 spike (thin
excel.tsport + grid-parse on the Slate Trucks sheet) tests the row-granular grid-parse before S3/S4 commit — see Dependencies & execution waves. - Acceptance: a Slate Trucks pricing table produces row-level chunks; the former Gate-0a table-fidelity question (②) clears its bar on the scorecard.
- Size: ~2–3 days. Higher risk than its size implies — S4 deterministic extraction's strongest (free, exact) path leans on this; if it slips, extraction falls back to the LLM (cost + accuracy hit). The early spike is the mitigation.
S3 — Per-KB attribute schema + JSONB storage + typed indexes (the OD1 substrate). [new — split from old S3]
- Deliverable: the storage/indexing substrate the attribute capability stands on — a per-KB declared attribute schema, JSONB attribute storage on chunks, typed indexes per declared field, write-time validation.
- Mechanics (OD1/OD2 locked): add
attributes_schema JSONBtoknowledge_bases(today002_kb_primitive.sqlhas only name/slug/description — the natural home). Add anattributes JSONBcolumn tochunks(today001_init.sqlhas no typed or keyword columns). On declaration/promote, create a typed composite expression-index per distinct declared field-name, keyed on(knowledge_base_id, expr)— e.g.CREATE INDEX CONCURRENTLY … ON chunks (knowledge_base_id, ((attributes->>'hp')::numeric))(red-team 6/15: composite-per-field-name over per-KB-partial, for tenant scaling — see note). Write-time validation against the declared schema. - Depends on: nothing structurally — build parallel to S1/S2. Precondition for S4 (storage target) and S5 (filter target).
- Acceptance: a declared schema persists; a chunk written with a typed attribute validates + lands in JSONB;
EXPLAINshows aWHEREplan using the composite index. - Size: ~2–3 days. ✅ Red-team resolved (6/15): per-field index DDL is owned by the runtime promote/declare routine (
CREATE INDEX CONCURRENTLY IF NOT EXISTS, idempotent, online) — not the migration system, which owns only the staticattributes/attributes_schemacolumns (forced by OD1's runtime-declared schemas). Index shape = composite(knowledge_base_id, expr)per distinct field-name, not one partial index per KB×field — this bounds index count by distinct field-names, not tenants × fields. Refines OD1's literal partial-index form; same intent (typed, KB-scoped filtering).
S4 — Extraction stage: per-chunk keyword metadata + typed-attribute extraction (deterministic-first, LLM-fallback). [old S3, minus storage]
- Deliverable: one ingestion stage emitting, per chunk, (a) keyword/lexical metadata (fixes prose FTS=0) and (b) typed attributes targeting the declared schema. Keyword + attribute extraction are one stage.
- Mechanics:
- Keyword side: generate salient lexical anchors per chunk; store in a keyword column folded into the FTS query. Note: FTS today is
to_tsvectorcomputed at query time (functional, not a stored column), so this = add the keyword column + a generatedtsvector/ expression index + widenretrieval/src/fts-search.tsand the lookup path to query it. Biggest retrieval lever: prose chunks otherwise return FTS=0 → hybrid collapses to vector-only on exactly the content that dominates Brehob + dev-memory. - Attribute side (OD5/OD6, deterministic-first / LLM-fallback): deterministic family on machine-regular sources — cell-coordinate (S2's row chunks: column→attribute), table-grid parse, labeled-pattern regex; LLM-fallback on variable prose, piggybacked on the existing Haiku call in
ingestion/extractor/sdk-client.ts(invokeSdkExtraction→anthropic().messages.create~line 136; Haiku 4.5; versionschunk-grouping-v3/-prose-v1) — no second pass. Extraction targets the declared schema (synonyms collapse: "sq ft"→square_footage); genuinely-new concepts → candidate flags (OD9), consumed by S6. The temporal attribute (date) is what S5's recency rank-boost keys on (M1).
- Keyword side: generate salient lexical anchors per chunk; store in a keyword column folded into the FTS query. Note: FTS today is
- Depends on: S2 (row-granular tables for cell/grid), S3 (schema to target + typed storage).
- Acceptance (measured): re-run the scorecard pre/post on unstructured docs — hybrid-recall lift clears the noise floor (paired, significance not point estimates); extraction accuracy vs the S0 attribute gold clears its bar.
- Size: ~3–4 days.
S5 — Filter-then-rank retrieval operator (the composable pre-filter — OD12). [old S6]
- Deliverable: the fourth retrieval mode — a
WHEREpre-stage on typed attributes (range / equality / set / date) that composes with the vector / FTS / recency rankers. Not a standalone 4th operator (OD12); hard filter then rank (OD4). - Mechanics: new code in
retrieval/src/alongsidevector-search.ts/fts-search.ts/lookup-section.ts—hybrid_searchis a type name only intypes.ts(unimplemented); this fills it as a pre-filter, not a sibling. The filter prunes on S3's composite indexes; survivors rank by embedding distance or FTS rank; recency an optional boost. Omitting the filter = today's behavior (OD12). ✅ Recency model (resolved 6/15, M1): dev-memory's "newest wins, history visible" = the chunk-leveldateattribute (S4) + a recency rank-boost here — not supersession. Hard supersession stays document-level (superseded_atis ondocuments, notchunks— corrects OD7's basis; retrieval already honors it viaincludeSuperseded?). So S5 needs no new chunk-level temporal model. - Depends on: S3 (typed columns + indexes), S4 (data to filter).
- Acceptance (measured): a structured-filter index added to the per-index scorecard; recall vs the S0 attribute-filter gold clears its bar. The concrete test — "
10 ≤ hp ≤ 20ANDlubrication = oillessANDcfm ≥ 90, ranked by relevance" — returns correctly. - Size: ~2–3 days (recency is date-attribute + rank — no new temporal model).
S6 — Schema-curation UI: propose-and-curate bootstrap + promote-then-backfill (OD9/OD10). [new]
- Deliverable: the human-in-the-loop schema lifecycle — bootstrap (LLM proposes a schema from the first docs; user accepts/renames/prunes), candidate review (promote a flagged attribute), promote-then-backfill (add field → create index → bounded single-attribute re-extract).
- Mechanics (OD9/OD10 locked): propose-and-curate over S4's candidate flags; manual declaration also supported (system KBs like dev-memory declare directly). Promote routine = add to KB
attributes_schema→ create the S3 composite index → run a targeted single-attribute backfill (bounded; rides Batch S7). The backfill is a single-attribute re-extract and is self-contained — it does NOT need the deferred Incremental Re-Ingestion epic (that's for full doc re-upload / content-diff); stated so it isn't a hidden cross-epic dep. - Depends on: S3, S4 (and S7 for backfill economics).
- Acceptance: bootstrap proposes a schema on a fresh KB; promote creates an index + backfills one attribute without full re-ingestion.
- Size: ~3–5 days. 🔵 Blue-team trim-candidate: for Brehob go-live the KB schema can be declared manually (OD10 supports it) — the full propose-and-curate UI + promote-then-backfill is product-grade (north-star B-series) and deferrable post-go-live without blocking the Brehob path.
S7 — Batch ingestion. [old S4]
- Deliverable: LLM-touched work routed through the Anthropic Message Batches API (~50% cheaper, ~2× faster — validated, never built).
- Mechanics:
BATCH_MULT = 0.5exists atingestion/pricing.ts:40; ZERO Batch calls in the codebase today. Three riders: Haiku vision extraction, Haiku prose grouping, keyword + attribute generation (S4). Carry-forward red-team targets (archived batch epic): poll-based progress UX (Batch has no webhooks), route LLM-routed-units only, structured-stays-sync. - Depends on: soft on S4 (the attribute/keyword rider); build parallel and wire.
- Acceptance: the per-doc cost columns (migration 013) show the ~50% cut on a batch-routed ingest.
- Size: ~2–3 days. 🔵 Trim-candidate ③ (program fallback list): defer post-go-live if item 1's early (curated-corpus-size × $/doc) read is modest.
S8 — Failure surfacing. [old S5]
- Deliverable: a failed conversion/ingest is visible to the uploader — per-doc failure states in the UI, DLQ items traceable to a document + reason, continue-on-error preserved.
- Mechanics: Brehob's legacy corpus guarantees conversion failures; silent loss is the worst outcome. DLQs were purged to a clean baseline 6/10 — every future item is a real signal. Surface per-doc state + reason; keep batch ingest continue-on-error.
- Depends on: S1 (conversion failure modes), the ingest pipeline + UI.
- Acceptance: one deliberately-poisoned doc fails visibly in the UI with a traceable reason; the rest of the batch completes.
- Size: ~1–2 days.
Dependencies & execution waves
Topo order from the depends-on edges (story numbers are dependency-ordered, so they mostly flow forward):
- Wave 1 (no upstream): S0 (gold), S1 (conversion), S3 (schema substrate) — plus an early S2 table-chunking spike (thin
excel.tsport + grid-parse on the Slate Trucks sheet) to test the row-granular deterministic-extraction premise before S3/S4 commit (red-team 6/15). The spike feeds the S0–S2 checkpoint with real data. S7/S8 skeletons can also start. - Wave 2: S2 (needs S1) — the full table-aware chunker, building on what the wave-1 spike learned.
- Wave 3: S4 (needs S2 + S3).
- Wave 4: S5 (needs S3 + S4); S6 (needs S3 + S4 + S7).
- Threaded throughout: S7 (parallel; wires into S4), S8 (parallel; wires into S1).
Critical-path spine: S1 → S2 → S4 → S5 (convert → tables → extract → operate) is the longest chain and the go-live-critical capability; S3 runs parallel but is required by S4/S5; S0 gates acceptance up front. The S0–S2 checkpoint (S0 bars + the S1/S2 conversion-table-$/doc read, seeded by the wave-1 spike) is the go/no-go + trim-decision point.
Sizing & trims
| Story | Size (d) | Wave | Trim? |
|---|---|---|---|
| S0 gold + bars | ~1 | 1 | — |
| S1 conversion | 2–4 | 1 | — |
| S2 routing + tables | 2–3 | 2 (spike in 1) | — |
| S3 schema substrate | 2–3 | 1 | — |
| S4 extraction | 3–4 | 3 | — |
| S5 filter-then-rank | 2–3 | 4 | — |
| S6 curation UI | 3–5 | 4 | 🔵 manual-declare for Brehob → defer UI |
| S7 batch | 2–3 | threaded | 🔵 fallback ③ → defer if $/doc modest |
| S8 failure surfacing | 1–2 | threaded | — |
Estimates, to be pressure-tested at the S0–S2 checkpoint — they are not commitments. Full S0–S8 ≈ 18–28 focused days ≈ 4–6 eng-weeks. With the candidate trims (S6 → manual declaration, S7 → fallback ③, promote-then-backfill → post-go-live), the go-live-critical core ≈ 12–18 days ≈ 2.5–3.5 eng-weeks. Window: must finish before M3 (Jul 20–31) on solo bandwidth (+ weekly Brehob meetings + legal track + beta keep-alive) — tight. Trim policy (decided 6/15): reactive, not pre-committed — the trims above are candidates; the cut decision is made at the S0–S2 checkpoint on real sizing data (the wave-1 spike + S0–S2 measurements), not on these estimates.
Acceptance
The eval slice re-ingested through the production path end-to-end: legacy formats convert server-side (S1), regime routing correct + tables row-granular (S2), per-chunk keyword metadata + typed attributes present (S3/S4), the filter-then-rank operator answers attribute queries (S5), and one deliberately-poisoned doc fails visibly in the UI (S8). Batch (if in scope) shows the ~50% cost cut in the per-doc cost columns (S7). Ships behind the harness gate — an Eval run gates the merge.
Pass bars (LOCKED 2026-06-15, Dan)
Framed as trust thresholds (QuoteAI has a human approval step), not correctness guarantees.
| Dimension | Bar |
|---|---|
| Extraction — specs (hp/cfm/psi, cell route) | ≥ 98% |
| Extraction — prose attrs (lubrication, dates, LLM route) | ≥ 90% |
| Vector recall@10 (semantic Qs) | ≥ 0.90 |
| FTS recall@5 (model/part-number Qs) | ≥ 0.85 |
| Structured-filter recall (attribute Qs) | ≥ 0.95 |
| Hybrid vs vector lift | positive + clears the noise floor (significance, not point estimate) |
| One-time $/doc (ingest) | ≤ $0.25 |
| Recurring $/query (retrieval) | ≤ $0.05 |
Eval phasing (same corpus, escalating questions)
- Phase 1 — Slate Trucks alone: proves extraction accuracy + operator mechanics. Recall here is a sanity check only — an n=1 small-corpus artifact, not a real grade (one candidate can't be discriminated).
- Phase 2 — + 3–5 distractor quotes (blower-purge dryers, the Hankison fridge-dryer ladder, ideally a 2nd compressor at a different HP/lubrication): proves retrieval discrimination; the recall + filter bars become the real grade. ← this epic's true acceptance.
- Phase 3 — end-to-end build + A/B (hold out the real Slate quote, build it from precedent quotes, judge generated-vs-real on specs / line-items / pricing / phrasing): the QuoteAI vertical's (roadmap item 4) acceptance, not S0's. Seeded by the Slate Trucks pair.
Regression (existing corpus, pre/post S4)
Re-run the full existing scorecard: vector + lookup stable (we add columns, don't re-chunk), FTS holds-or-lifts on prose (the keyword metadata is generic). No typed-attribute gold on novels — attributes are schema-gated (OD9); a schemaless KB stores none (the LLM may still propose candidates, but nothing typed is stored/filterable without a declared schema). Optional anti-overfit: declare a minimal date / section_id schema on one reg doc (FIA / Constitution) to validate the typed path on non-quote content.
Gold seed (Slate Trucks, extracted 2026-06-15)
Chorus 90: hp 1000, cfm 3840, psi 125, lubrication oilless (prose-only — the built-in LLM-path test), water-cooled, 6 units, $473,498/ea. HPB4300 Hankison dryer: cfm 4300, 6 units. Spreadsheet (cell route) + proposal (datasheet + prose) corroborate — one case exercises both extraction routes and cross-checks them. Slug brehob-proposals__slate-trucks.
References
- Architecture (do not re-litigate): sub-systems/ingestion-pipeline — OD1 (JSONB + partial-indexes), OD4 (filter-then-rank), OD9/OD10 (schema lifecycle), OD12 (composable pre-filter).
- Code (verified 6/15): retrieval operators
retrieval/src/{vector-search,fts-search,lookup-section}.ts(+types.tshybrid_searchstub); routingingestion/extractor/route.ts; LLM callingestion/extractor/sdk-client.ts(invokeSdkExtraction); chunksdb/migrations/001_init.sql; KBdb/migrations/002_kb_primitive.sql(superseded_atis here, document-level); costdb/migrations/013_ingest_cost.sql+014_query_cost.sql;ingestion/pricing.ts(BATCH_MULT); Excel parserquoteai/ingestion/parsers/excel.ts. - Decisions: DB-3 (two paths, three regimes), D62/D63 (FTS=0 finding), chunking-search verdict (Haiku grouping kept). Eval discipline:
eval-pipelineskill; significance-over-point-estimates. - Build-kickoff hygiene: run a precise deployed-vs-
maindiff on the ingestion path before S1 (deployed can lag main). - Pull back
archive/epics/batch-ingestion(into S7) andarchive/sub-systems/{eval-corpus-and-doe, pipeline-eval-harness, unified-chunking-markdown}.