Ingestion & Retrieval Pipeline — Sub-system Design Doc
The pipeline that turns raw documents into retrievable knowledge, plus the retrieval operators that serve it. This doc adds one new capability — structured-attribute retrieval — and the coupled pipeline changes it requires, positioned against the current (deployed + local) state. It is the architecture layer above ingestion-foundation (roadmap item 1 of brehob-launch) and the generic substrate consumed by QuoteAI (spec-match) and dev-memory (recency). → North Star B1 / B3.
Status: DRAFT — authored 2026-06-15 out of the Gate-0 red-team, which converged on "Autri is missing exactly one retrieval capability." OD1/OD2/OD4 locked 2026-06-15; review folded in same day (lookup-vs-attribute, tables, versioning split, graph-RAG deferral, OD9–OD12). The remaining Open Decisions are the triage targets.
Risks & Constraints
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
Extraction fidelity is poor on the real legacy corpus (20 years of drifting .doc templates) | Med | High | Deterministic-first, LLM-fallback; the harness measures extraction accuracy per field before we trust it; tune extraction, not rearchitect. |
| The primitive over-fits to quotes (n=1 vertical) | Med | Med | Keep typed attributes generic on chunks; entity rollups stay in the vertical; validate on the dev-memory consumer too, not just Brehob. |
| Attribute extraction adds an LLM pass and blows cost | Low | Med | Piggyback the existing Haiku grouping call; deterministic route pays nothing; cost measured via migrations 013/014. |
| Building before measuring (we skipped the throwaway spike) | Med | Low | The capability is wanted in Autri regardless of Brehob; S0 gold + pass bars are authored before the build; incremental build with a harness checkpoint per step. |
Deployed pipeline has drifted behind main | Med | Med | Run a precise deployed-vs-main diff on the ingestion path as a build-kickoff step (not a design blocker). |
Overview
Where this sub-system stands today, how it got here, and what it is.
Current Status
| Capability | Status |
|---|---|
| Three retrieval operators — vector (semantic), FTS (keyword), lookup (exact section id) | Shipped (deployed) — retrieval/src/{vector-search,fts-search,lookup-section}.ts |
Two-path routing — markdown→STRUCTURED deterministic (no LLM); prose→Haiku grouping (chunk-grouping-v3 / prose-v1) | Shipped (deployed) |
| Per-doc cost instrumentation (migrations 013/014) — token + USD per document, vision bucket included | Shipped (deployed) |
Local eval harness — per-index recall@k + MRR scorecard, query-gold discipline (eval-pipeline skill) | Shipped (local-only) — see pipeline-eval-harness |
Table / line-item handling — chunk_type:'table' is a label only, no parser | Greenfield |
| Per-chunk keyword metadata (fixes prose FTS=0) | Greenfield — ingestion-foundation S3 |
| Structured-attribute extraction + filter-then-rank operator (this doc) | Greenfield — the centerpiece |
The Story
The Gate-0 red-team set out to measure whether Autri's generic chunk-embed pipeline could ingest the Brehob quote corpus at retrievable fidelity. Grounding the doc's claims against the real repos flipped the premise. Three findings: (1) QuoteAI already extracts quotes into typed records (products with hp_range/cfm_range/psi_range NUMRANGE + lubrication) and retrieves with a hybrid filter-then-rank — the "numeric-filter primitive" the roadmap parked as a maybe-build already exists in the vertical. (2) The spreadsheet→formal-quote transformation is already built (Template-C parser + drafter); its per-line-item retrieval is 3/4 native to Autri (verbatim-description and analog lookups are pure vector/lookup), and the one gap is the numeric spec-match — a validation step, not the spine. (3) Mapping QuoteAI's six retrieval tools onto Autri's three operators, five are already covered; only search_equipment's structured filter is missing.
So the question stopped being "can the generic pipeline do this" and became "Autri is missing exactly one retrieval capability — structured-attribute filtering — and it is generic." A second consumer confirmed it isn't Brehob-specific: dev-memory dogfooding needs the same primitive to weight session decisions by recency (most-recent supersedes; history stays visible). The capability is therefore core substrate, and Gate-0 collapses from "spike before build" into "the eval-acceptance gate on this build."
What Is This Sub-system?
The ingestion + retrieval pipeline owns the path from a raw document to retrievable knowledge: convert → route → chunk → extract attributes → embed → index, and the operators that query the result. It exists as its own layer because every consumer (QuoteAI, dev-memory, every future vertical) plugs into the same retrieval contract; a change here ripples to all of them. This doc's addition — structured-attribute retrieval — is the fourth retrieval mode alongside vector / FTS / lookup, and the ingestion step that feeds it.
The Gap (verified against code)
Autri exposes exactly three retrieval operators and no structured-attribute filter:
vectorSearch— cosine over embeddings; filters only onknowledge_base_id,documentIds,chunkTypes.ftsSearch— Postgres FTS; same filter surface.lookupSection— exactsection_idmatch (e.g.C7.6.2); not an attribute query.
Chunks carry text + chunk_type + section_id + embedding + bbox — zero typed numeric/categorical attributes, and ingest extracts none. The concrete test "find equipment where 10 ≤ hp ≤ 20 AND lubrication = oilless AND cfm ≥ 90, ranked by relevance" cannot be answered today — no typed columns, no filter operator. (hybrid_search exists as a type name only; unimplemented.)
Target: Structured-Attribute Retrieval (two halves)
Half 1 — Generate typed attributes during ingest. At chunk time, attach typed attributes to each chunk: numeric (hp, cfm, psi), categorical (lubrication, manufacturer), temporal (date). This is the "LLM does semantics, code does mechanics" principle made concrete: read the value out of messy text, store it typed and queryable.
Half 2 — A filter-then-rank operator. A fourth retrieval mode: WHERE on typed attributes (range / equality / set / date), composed with ORDER BY embedding distance (or FTS rank). Hard constraints prune first; semantic similarity ranks the survivors — exactly QuoteAI's search_equipment shape, generalized.
Both are required: extraction without the operator is inert data; the operator without extraction has nothing to filter.
Deterministic-First, LLM-Fallback Extraction
The split is not "structured docs → regex, prose → LLM." It's how machine-regular the source is — mirroring the existing chunking philosophy (code is the default; the LLM fires only where the document doesn't declare its own structure):
- Machine-regular sources → deterministic/code extraction (exact, free, no LLM). A small family, by how the value is addressed:
- Cell-coordinate — the value sits at a known (row, col).
Template-Calready does this: CFM = col 9, HP = col 11; map columns → declared attributes. Strongest case; needs the template layout. - Table-grid parse — a (converted) markdown pipe-table's header row gives column names, cells give values; parse the grid, map columns → attributes. (See Tables below.)
- Labeled-pattern (regex) — a value following a regular label ("SYSTEM CAPACITY: 31.2 SCFM" →
CAPACITY:\s*([\d.]+)). The brittle one; only "deterministic" where the label format is truly regular, else it's the LLM's job.
- Cell-coordinate — the value sits at a known (row, col).
- Variable prose → LLM extraction. The 8,600+
.docquotes span 20 years and many templates ("PLEX HORSEPOWER" vs "HP" vs "Horsepower"; drifting units/layout). The LLM's job is "get the HP however it's labeled," piggybacked on the existing Haiku grouping call (no second pass).
Note: "structured doc → no LLM" was about chunking. For attributes, even a structured doc can need the LLM if its values live in free prose — the split is value-regularity, not doc-type. We don't guess it; the harness measures deterministic coverage and the LLM fills the remainder, scored on extraction accuracy.
Tables — the densest attribute source
A table is where conversion, chunking, and attribute-extraction meet — and it's the richest attribute source in the corpus (a pricing table's CFM/PSI/HP/LIST columns are the typed attributes). The path:
- Convert to markdown (the conversion stage, ingestion-foundation S1) → the table in a parseable pipe-table form. Necessary, not sufficient.
- Grid-parse (the deterministic "table-grid" path above) → columns become typed attributes.
- Row-level granularity → each row becomes its own chunk carrying the column-attributes, so "the 45hp Powerex line item and its price" is retrievable — displayed as the whole table (retrieval ≠ display granularity).
Today the chunker leaves converted tables as chunk_type:'text' (no row granularity, no cell-attributes); closing that is the table-handling work in ingestion-foundation S2, and it's the precondition for line-item-level structured retrieval.
The Autri ↔ QuoteAI Boundary
Attributes live on chunks (generic, filterable) — this is Autri substrate. Entity rollups (QuoteAI's products / quotes typed tables, used for cross-corpus aggregation like "the cheapest oilless 15hp system we've quoted") stay in the vertical. The trade: per-chunk attributes answer "find content matching these specs" cleanly; entity-level aggregation/dedup is weaker per-chunk and belongs to the vertical that needs it. This line keeps the substrate generic and the vertical thin.
Recency & Supersession (the dev-memory consumer)
The dev-memory dogfood needs more than date > X: when a decision flips across sessions, the current session must see both the standing decision and the history, with the most-recent treated as authoritative. Resolved 2026-06-15 (ingestion-foundation S5 red-team): this is served by the chunk-level date attribute (a first-class typed attribute) + recency as a rank-boost — not by supersession. History stays retrievable, just ranked lower.
Hard supersession is a separate, document-level concept: superseded_at lives on documents, not chunks (an earlier draft of this section said "on chunks" — corrected), and retrieval already honors it via includeSuperseded?. It's for a replaced document version, not a down-ranked older decision. So the operator needs no new chunk-level temporal model — recency-rank for "newest wins, history visible", document-level supersession for replaced versions. Designing it with Brehob's spec-match and dev-memory's recency in view is what keeps it generic rather than quote-shaped.
Schema Lifecycle
Defining the schema and evolving it are the same flow — the system proposes, the user curates — at different times. JSONB storage (OD1) is what makes this cheap.
- Bootstrap (KB creation). The user doesn't hand-author a typed schema cold. On the first batch of docs, the extractor proposes a candidate schema (a townhouse-purchase KB →
purchase_price,closing_date,address,loan_amount,interest_rate); the user curates — accept / rename / prune — in the UI. System-controlled KBs (dev-memory) may declare directly. The user curates; they don't author from scratch. - Steady state. Docs with values for already-declared attributes are just extracted. Because extraction targets the declared schema, synonyms collapse ("sq ft" → declared
square_footage); only genuinely new concepts escape. - Expansion. A doc introducing a new attribute is flagged as a candidate (rides the existing call — "also saw
property_tax_rate = 1.2%"); the schema does not silently expand. The system surfaces "seenproperty_tax_ratein N docs — promote to a filterable field?" Noise stays out until the user opts in. - Promote-then-backfill. On promote: (a) the field is added to the KB schema; (b) a typed partial-index is created (online, no table change); (c) a targeted backfill re-extracts just that attribute across existing docs (bounded, single-attribute, rides Batch + incremental re-ingestion) and fills the JSONB. No full re-ingestion — and the docs were fully retrievable by vector/FTS throughout; promotion only adds filterability.
- Versioning (two kinds, don't conflate). This work needs only a light KB-schema version stamp — so "which docs predate this attribute" is answerable for promote-then-backfill. Full document-content versioning (a doc re-uploaded/edited) is a separate, larger capability — the Incremental Re-Ingestion epic — that this work does not block on (it comes after, and composes: re-ingesting a changed doc only re-extracts changed units' attributes).
The JSONB payoff: schema evolution is a metadata + index + bounded-backfill operation, not a table migration + full re-extract.
Architecture
The internal shape of the pipeline and where it interfaces outward.
Architecture Diagram
raw doc ─▶ convert ─▶ route(docClass probe) ─┬─ STRUCTURED ─▶ deterministic chunk ─┐
│ │
└─ PROSE ──────▶ Haiku grouping ───────┤
▼
┌──────── attribute extraction ────────┐
│ deterministic (cell / grid / pattern) ─┐
│ LLM fallback (rides Haiku call) ───────┴─▶ typed attrs
└──────────────────────────────────────────┘
▼
chunk { text, type, section_id, attrs, embedding }
─ retrieval ─────────────────────────────────────────────────────────────────────────
CONTENT SEARCH query ─▶ rank by ┬─ vector (semantic)
└─ fts (keyword)
STRUCTURED ACCESS
• attribute filter (NEW): WHERE attrs (range/eq/in/date) ─prunes─▶ rank by [ vector | fts | recency ]
(the filter is an optional pre-stage; omit it = today's behavior — see OD12)
• lookup: by section_id — hierarchy traversal + document order, no ranking
System Boundary
Inside: conversion, routing, chunking, attribute extraction, embedding, the chunk schema (incl. typed attributes), and the four retrieval operators. Outside: the eval harness (grades this layer, doesn't own it — pipeline-eval-harness); the vertical's entity rollups + drafter (QuoteAI); deploy/tenancy (enterprise-deploy). The one requirement this layer imposes outward: consumers declare the typed attributes they care about (the per-KB attribute schema).
Key Interfaces
| Interface | Type | Consumers |
|---|---|---|
filterRankSearch(kb, {attrFilters, query, recency?, k}) | Function (NEW retrieval op) | QuoteAI spec-match; dev-memory recency; future verticals |
| Per-chunk typed attributes — JSONB + per-field typed partial-indexes (OD1) | Schema (chunk column) | The operator; the harness scorer |
| Per-KB attribute schema (declared typed fields + lifecycle) | Config | Ingest extraction; validation; index creation; each consuming library |
| Attribute-extraction stage (deterministic-first, LLM-fallback) | Pipeline stage | Rides the existing extractor; gated by the harness |
Retrieval Operators — Lookup vs Structured-Attribute
The four operators answer four different questions:
| Operator | The question | Keys on | Returns |
|---|---|---|---|
| Vector | "find content that means this" | semantic similarity | ranked, fuzzy |
| FTS | "find content that says these words" | exact lexical tokens | ranked, by keyword |
| Lookup | "fetch the content located at this address" | the document's declared structure (section_id) | exact, document order, hierarchy-aware |
| Structured-attribute | "find content whose extracted properties satisfy these constraints" | derived facts (hp, cfm, date…) | filter-then-rank |
The boundary that matters: lookup keys on the document's intrinsic identity (the address the author assigned; exact, ordered, hierarchy-aware — it walks the sections tree), while structured-attribute keys on facts we derived (ranges, sets, comparisons + ranking). Given structure vs derived facts; retrieve by address vs retrieve by property. They even shine on different doc types — lookup on highly-structured docs with real addresses (regs, contracts), attribute-filter on fact-laden docs you query by value (quotes, pricing sheets).
Could the attribute operator absorb lookup? Partly — section_id is itself a categorical attribute, so exact-match lookup is expressible as WHERE section_id = X. But lookup uniquely adds (1) hierarchy traversal (ask for C7, get its whole subtree — needs the section tree, not flat equality) and (2) document-order contiguous return (the section as written, no relevance ranking). So lookup isn't redundant — it's the structural-address pattern.
The section tree is also the graph-shaped thing: a containment tree (one-relationship graph), and QuoteAI's entity FKs are a second small graph. A full knowledge graph / graph RAG is deferred (see Decisions Log) — it earns its keep only on multi-hop relationship queries we don't have yet; when one appears, it belongs in the entity-rollup layer, not the chunk substrate.
→ Open: OD11 (make section_id a built-in hierarchical attribute, unifying lookup under the filter?) and OD12 (is structured-attribute a distinct operator or a composable pre-filter prepended to vector/fts/recency?).
Eval Integration (the harness is the gate)
This capability is a first-class harness citizen, which is why we can build it before fully measuring it. Two new axes on the existing per-index scorecard:
- Extraction accuracy — precision/recall of extracted typed values vs a hand-labeled attribute gold (did we get
hp=20,cfm=31.2,lubrication=oilless,date=2019-07-19?). Cheap to score; no retrieval needed. - Filter-then-rank recall — add a structured-filter index to the per-index gold (alongside vector/FTS/lookup), with attribute-filter queries.
Discipline carries over from the harness: baseline-first (delta vs current), per-type floors, significance over point estimates (small-n noise floor stated). An Eval run gates the merge.
Open Decisions (red-team targets)
Status 2026-06-15: all decisions closed. OD1/OD2/OD4 locked; OD3, OD5–OD8 settled; OD9–OD12 locked on review. Ready for the ingestion-foundation epic refinement + red/blue-team.
| # | Decision | Resolution |
|---|---|---|
| OD1 | Storage shape | LOCKED — per-KB declared schema drives JSONB storage + typed partial expression-indexes (one per declared filterable field, scoped by knowledge_base_id), with write-time validation. Physical typed-columns-per-KB rejected: runtime DDL + table-per-KB sprawl fights the one-table row-level tenancy model, for a marginal query-speed gain. |
| OD2 | Extraction contract | LOCKED — the per-KB declared attribute schema is the control plane: it targets extraction, validates on write, and defines the filterable surface. |
| OD3 | Granularity / boundary | Settled — attributes on chunks → Autri; entity rollups → the vertical. |
| OD4 | Operator surface | LOCKED — hard filter (WHERE) then rank (embed/FTS); recency an optional boost on the survivors. |
| OD5 | Extraction cost/route | Settled — piggyback the Haiku grouping call; deterministic route is free; measure via 013/014. |
| OD6 | Deterministic-vs-LLM routing | Settled — deterministic-first, LLM-fallback; harness measures the split. |
| OD7 | Supersession model | Settled — build on superseded_at; recency is a rank signal, history stays visible. |
| OD8 | Eval shape | Settled — extraction-accuracy gold + a structured-filter index in the scorecard. |
| OD9 | Candidate-flagging vs observe-everything | LOCKED → candidate-flagging. Extraction targets the declared schema and cheaply flags candidate new attributes (rides the existing call) → promote-then-backfill. Observe-and-store-everything rejected (speculative extraction + JSONB bloat). |
| OD10 | Onboarding UX | LOCKED → propose-and-curate. LLM proposes a schema from the first docs; the user accepts/renames/prunes; manual declaration also supported (system KBs like dev-memory). Manual-only = friction; fully-auto = drift. |
| OD11 | section_id as a built-in attribute? | LOCKED → keep distinct, share storage. Lookup stays its own operator (its hierarchy + document-order semantics differ); it shares the typed-attribute plumbing but is not collapsed into the filter. |
| OD12 | Operator vs composable pre-filter | LOCKED → composable pre-filter. Structured-attribute is a WHERE pre-stage that prepends to the vector/fts/recency rankers (one filter, reused across rankers), not a standalone 4th operator. |
Related Epics
| Epic | Doc | Status | Summary |
|---|---|---|---|
| Ingestion Foundation | ingestion-foundation | Refined (6/15) | The work breakdown — refined to executable story-level detail 2026-06-15 (nine dependency-ordered stories, S0–S8): the eval gold + pass bars, the schema / JSONB / partial-index substrate, the keyword + typed-attribute extraction stage, the filter-then-rank operator, and the schema-curation UI. |
| Gate-0 Corpus Spike | gate-0-corpus-spike | Re-scoped | No longer a spike-before-build; becomes the eval acceptance of this capability on the real corpus (Slate Trucks + slice), folded into ingestion-foundation. |
| QuoteAI Vertical | quoteai-vertical | Planned | Consumes the operator for spec-match; keeps its entity rollups + drafter. Output ⑥ (numeric primitive) resolved here. |
| Dev-Memory | dev-memory | Planned | Second consumer — recency/supersession over session transcripts. ⚠️ Confirm the supersession grain: superseded_at is document-level today, which may be the right grain for episode-documents. |
| Incremental Re-Ingestion | (to be written) | Planned (after) | Document-content versioning: detect a re-uploaded doc, chunk-diff by content_hash (content-based, not positional), re-process only changed units. Composes with attribute extraction (re-extract only changed units). The S6 single-attribute backfill does not depend on it. |
Cross-Cutting Concerns
| Concern | How This Sub-system Is Affected |
|---|---|
| Cost (D16/D18) | Deterministic extraction is free; LLM extraction rides the existing Haiku call; Batch economics (ingestion-foundation S4) apply. Measured via 013/014. |
| Multi-tenancy (D13) | Attributes are per-KB; the per-KB attribute schema is the tenant-scoped declaration; partial-indexes are KB-scoped. |
| LLM-semantics / code-mechanics | The governing principle: LLM extracts the value (and proposes the schema), code stores/filters/indexes it typed. |
| Local CI/CD for agentic coding | Extraction-accuracy scoring is deterministic → fits the local gate; filter-then-rank recall is Eval-mode (needs Postgres + embeddings). |
Decisions Log
| Date | Decision | Rationale | Alternatives Considered |
|---|---|---|---|
| 2026-06-15 | Autri's one retrieval gap is structured-attribute filtering; build it as a generic 4th mode | 5 of QuoteAI's 6 tools already map to Autri's 3 operators; the 6th (search_equipment) is the only gap, and it's generic | Lift QuoteAI's whole typed schema into Autri (premature, n=1); ship Brehob on QuoteAI as-is (re-opens two-codebases) |
| 2026-06-15 | Gate-0 merges into this build; the eval harness is the gate | The architectural de-risking is done by inspection; remaining unknowns need a built thing to measure; the capability is wanted regardless of Brehob | Throwaway spike first (wasteful — the build is wanted anyway) |
| 2026-06-15 | Attribute extraction is deterministic-first, LLM-fallback | Spreadsheets/forms are machine-regular (free, exact); 20-year prose is too variable for regex | Regex-everything (brittle on prose); LLM-everything (cost, and pointless on cell-regular sources) |
| 2026-06-15 | Typed attributes live on chunks (Autri); entity rollups stay in the vertical | Keeps the substrate generic and the vertical thin; avoids designing the abstraction from one example | Entity tables in Autri (domain-coupled substrate) |
| 2026-06-15 | Design the operator for both spec-match and recency/supersession | Two real consumers (Brehob, dev-memory) keep it generic, not quote-shaped | Quote-only filter (would under-serve dev-memory and bake in quote assumptions) |
| 2026-06-15 | Typed attributes = per-KB-declared schema over JSONB + typed partial-indexes (not physical columns/tables per KB) | Genericness + multi-tenant fit + cheap schema evolution; physical columns mean runtime DDL + table sprawl for marginal speed | Physical typed columns per KB; shared wide typed table; EAV |
| 2026-06-15 | Schema lifecycle = propose-and-curate (bootstrap + candidate-flagging) + promote-then-backfill | Low-friction onboarding, no silent schema drift, evolution is bounded not a migration | Manual-only (friction); auto-expand silently (drift); observe-and-store-all (bloat) |
| 2026-06-15 | Knowledge graph / graph RAG deferred | No multi-hop relationship query in the current use cases; hybrid covers single-hop; graphs are heavy + cut against inspectability; section tree + entity FKs already give graph-shaped access | Build a graph store now (premature, unjustified by use cases) |
| 2026-06-15 | Document-content versioning split into its own Incremental Re-Ingestion epic; only a light KB-schema-version stamp lives here | Chunk-diff reprocessing is a sizable separate capability; this work needs only schema-version for backfill | Bundle full doc-versioning into this work (scope creep) |
Known Issues / Tech Debt
| Issue | Severity | Notes |
|---|---|---|
| Table / line-item chunking is label-only | High | Converted pipe-tables land as text, not table; line-items aren't retrievable units. Resolution path in Tables above + ingestion-foundation S2. |
| Prose FTS = 0 (no lexical anchors) | High | Hybrid collapses to vector-only on exactly the prose that dominates Brehob + dev-memory. Per-chunk keyword metadata (S3) is the adjacent fix. |
No deployed-vs-main confidence on the ingestion path | Med | Run the diff before building (deploy-hygiene). |
| Corpus curation cruft | Med | Deep nested-duplicate dirs, .msg/.pst/.dwg, "– copy" trees; an inclusion filter precedes ingest (curation rules, Gate-0 S7). |
Sub-system docs define architectural boundaries. The test: remove this layer and multiple unrelated features break. Structured-attribute retrieval is consumed by QuoteAI, dev-memory, and every future vertical — remove it and all three lose hard-constraint + recency retrieval. Update this doc when the retrieval contract or the chunk schema changes.