Ingestion & Retrieval Pipeline — Sub-system Design Doc

The pipeline that turns raw documents into retrievable knowledge, plus the retrieval operators that serve it. This doc adds one new capability — structured-attribute retrieval — and the coupled pipeline changes it requires, positioned against the current (deployed + local) state. It is the architecture layer above ingestion-foundation (roadmap item 1 of brehob-launch) and the generic substrate consumed by QuoteAI (spec-match) and dev-memory (recency). → North Star B1 / B3.

Status: DRAFT — authored 2026-06-15 out of the Gate-0 red-team, which converged on "Autri is missing exactly one retrieval capability." OD1/OD2/OD4 locked 2026-06-15; review folded in same day (lookup-vs-attribute, tables, versioning split, graph-RAG deferral, OD9–OD12). The remaining Open Decisions are the triage targets.

Risks & Constraints

Risk	Likelihood	Impact	Mitigation
Extraction fidelity is poor on the real legacy corpus (20 years of drifting `.doc` templates)	Med	High	Deterministic-first, LLM-fallback; the harness measures extraction accuracy per field before we trust it; tune extraction, not rearchitect.
The primitive over-fits to quotes (n=1 vertical)	Med	Med	Keep typed attributes generic on chunks; entity rollups stay in the vertical; validate on the dev-memory consumer too, not just Brehob.
Attribute extraction adds an LLM pass and blows cost	Low	Med	Piggyback the existing Haiku grouping call; deterministic route pays nothing; cost measured via migrations 013/014.
Building before measuring (we skipped the throwaway spike)	Med	Low	The capability is wanted in Autri regardless of Brehob; S0 gold + pass bars are authored before the build; incremental build with a harness checkpoint per step.
Deployed pipeline has drifted behind `main`	Med	Med	Run a precise deployed-vs-`main` diff on the ingestion path as a build-kickoff step (not a design blocker).

Overview

Where this sub-system stands today, how it got here, and what it is.

Current Status

Capability	Status
Three retrieval operators — vector (semantic), FTS (keyword), lookup (exact section id)	Shipped (deployed) — `retrieval/src/{vector-search,fts-search,lookup-section}.ts`
Two-path routing — markdown→STRUCTURED deterministic (no LLM); prose→Haiku grouping (`chunk-grouping-v3` / `prose-v1`)	Shipped (deployed)
Per-doc cost instrumentation (migrations 013/014) — token + USD per document, vision bucket included	Shipped (deployed)
Local eval harness — per-index recall@k + MRR scorecard, query-gold discipline (`eval-pipeline` skill)	Shipped (local-only) — see pipeline-eval-harness
Table / line-item handling — `chunk_type:'table'` is a label only, no parser	Greenfield
Per-chunk keyword metadata (fixes prose FTS=0)	Greenfield — ingestion-foundation S3
Structured-attribute extraction + filter-then-rank operator (this doc)	Greenfield — the centerpiece

The Story

The Gate-0 red-team set out to measure whether Autri's generic chunk-embed pipeline could ingest the Brehob quote corpus at retrievable fidelity. Grounding the doc's claims against the real repos flipped the premise. Three findings: (1) QuoteAI already extracts quotes into typed records (products with hp_range/cfm_range/psi_range NUMRANGE + lubrication) and retrieves with a hybrid filter-then-rank — the "numeric-filter primitive" the roadmap parked as a maybe-build already exists in the vertical. (2) The spreadsheet→formal-quote transformation is already built (Template-C parser + drafter); its per-line-item retrieval is 3/4 native to Autri (verbatim-description and analog lookups are pure vector/lookup), and the one gap is the numeric spec-match — a validation step, not the spine. (3) Mapping QuoteAI's six retrieval tools onto Autri's three operators, five are already covered; only search_equipment's structured filter is missing.

So the question stopped being "can the generic pipeline do this" and became "Autri is missing exactly one retrieval capability — structured-attribute filtering — and it is generic." A second consumer confirmed it isn't Brehob-specific: dev-memory dogfooding needs the same primitive to weight session decisions by recency (most-recent supersedes; history stays visible). The capability is therefore core substrate, and Gate-0 collapses from "spike before build" into "the eval-acceptance gate on this build."

What Is This Sub-system?

The ingestion + retrieval pipeline owns the path from a raw document to retrievable knowledge: convert → route → chunk → extract attributes → embed → index, and the operators that query the result. It exists as its own layer because every consumer (QuoteAI, dev-memory, every future vertical) plugs into the same retrieval contract; a change here ripples to all of them. This doc's addition — structured-attribute retrieval — is the fourth retrieval mode alongside vector / FTS / lookup, and the ingestion step that feeds it.

The Gap (verified against code)

Autri exposes exactly three retrieval operators and no structured-attribute filter:

vectorSearch — cosine over embeddings; filters only on knowledge_base_id, documentIds, chunkTypes.
ftsSearch — Postgres FTS; same filter surface.
lookupSection — exact section_id match (e.g. C7.6.2); not an attribute query.

Chunks carry text + chunk_type + section_id + embedding + bbox — zero typed numeric/categorical attributes, and ingest extracts none. The concrete test "find equipment where 10 ≤ hp ≤ 20 AND lubrication = oilless AND cfm ≥ 90, ranked by relevance" cannot be answered today — no typed columns, no filter operator. (hybrid_search exists as a type name only; unimplemented.)

Target: Structured-Attribute Retrieval (two halves)

Half 1 — Generate typed attributes during ingest. At chunk time, attach typed attributes to each chunk: numeric (hp, cfm, psi), categorical (lubrication, manufacturer), temporal (date). This is the "LLM does semantics, code does mechanics" principle made concrete: read the value out of messy text, store it typed and queryable.

Half 2 — A filter-then-rank operator. A fourth retrieval mode: WHERE on typed attributes (range / equality / set / date), composed with ORDER BY embedding distance (or FTS rank). Hard constraints prune first; semantic similarity ranks the survivors — exactly QuoteAI's search_equipment shape, generalized.

Both are required: extraction without the operator is inert data; the operator without extraction has nothing to filter.

Deterministic-First, LLM-Fallback Extraction

The split is not "structured docs → regex, prose → LLM." It's how machine-regular the source is — mirroring the existing chunking philosophy (code is the default; the LLM fires only where the document doesn't declare its own structure):

Machine-regular sources → deterministic/code extraction (exact, free, no LLM). A small family, by how the value is addressed:
- Cell-coordinate — the value sits at a known (row, col). Template-C already does this: CFM = col 9, HP = col 11; map columns → declared attributes. Strongest case; needs the template layout.
- Table-grid parse — a (converted) markdown pipe-table's header row gives column names, cells give values; parse the grid, map columns → attributes. (See Tables below.)
- Labeled-pattern (regex) — a value following a regular label ("SYSTEM CAPACITY: 31.2 SCFM" → CAPACITY:\s*([\d.]+)). The brittle one; only "deterministic" where the label format is truly regular, else it's the LLM's job.
Variable prose → LLM extraction. The 8,600+ .doc quotes span 20 years and many templates ("PLEX HORSEPOWER" vs "HP" vs "Horsepower"; drifting units/layout). The LLM's job is "get the HP however it's labeled," piggybacked on the existing Haiku grouping call (no second pass).

Note: "structured doc → no LLM" was about chunking. For attributes, even a structured doc can need the LLM if its values live in free prose — the split is value-regularity, not doc-type. We don't guess it; the harness measures deterministic coverage and the LLM fills the remainder, scored on extraction accuracy.

Tables — the densest attribute source

A table is where conversion, chunking, and attribute-extraction meet — and it's the richest attribute source in the corpus (a pricing table's CFM/PSI/HP/LIST columns are the typed attributes). The path:

Convert to markdown (the conversion stage, ingestion-foundation S1) → the table in a parseable pipe-table form. Necessary, not sufficient.
Grid-parse (the deterministic "table-grid" path above) → columns become typed attributes.
Row-level granularity → each row becomes its own chunk carrying the column-attributes, so "the 45hp Powerex line item and its price" is retrievable — displayed as the whole table (retrieval ≠ display granularity).

Today the chunker leaves converted tables as chunk_type:'text' (no row granularity, no cell-attributes); closing that is the table-handling work in ingestion-foundation S2, and it's the precondition for line-item-level structured retrieval.

The Autri ↔ QuoteAI Boundary

Attributes live on chunks (generic, filterable) — this is Autri substrate. Entity rollups (QuoteAI's products / quotes typed tables, used for cross-corpus aggregation like "the cheapest oilless 15hp system we've quoted") stay in the vertical. The trade: per-chunk attributes answer "find content matching these specs" cleanly; entity-level aggregation/dedup is weaker per-chunk and belongs to the vertical that needs it. This line keeps the substrate generic and the vertical thin.

Recency & Supersession (the dev-memory consumer)

The dev-memory dogfood needs more than date > X: when a decision flips across sessions, the current session must see both the standing decision and the history, with the most-recent treated as authoritative. Resolved 2026-06-15 (ingestion-foundation S5 red-team): this is served by the chunk-level date attribute (a first-class typed attribute) + recency as a rank-boost — not by supersession. History stays retrievable, just ranked lower.

Hard supersession is a separate, document-level concept: superseded_at lives on documents, not chunks (an earlier draft of this section said "on chunks" — corrected), and retrieval already honors it via includeSuperseded?. It's for a replaced document version, not a down-ranked older decision. So the operator needs no new chunk-level temporal model — recency-rank for "newest wins, history visible", document-level supersession for replaced versions. Designing it with Brehob's spec-match and dev-memory's recency in view is what keeps it generic rather than quote-shaped.

Schema Lifecycle

Defining the schema and evolving it are the same flow — the system proposes, the user curates — at different times. JSONB storage (OD1) is what makes this cheap.

Bootstrap (KB creation). The user doesn't hand-author a typed schema cold. On the first batch of docs, the extractor proposes a candidate schema (a townhouse-purchase KB → purchase_price, closing_date, address, loan_amount, interest_rate); the user curates — accept / rename / prune — in the UI. System-controlled KBs (dev-memory) may declare directly. The user curates; they don't author from scratch.
Steady state. Docs with values for already-declared attributes are just extracted. Because extraction targets the declared schema, synonyms collapse ("sq ft" → declared square_footage); only genuinely new concepts escape.
Expansion. A doc introducing a new attribute is flagged as a candidate (rides the existing call — "also saw property_tax_rate = 1.2%"); the schema does not silently expand. The system surfaces "seen property_tax_rate in N docs — promote to a filterable field?" Noise stays out until the user opts in.
Promote-then-backfill. On promote: (a) the field is added to the KB schema; (b) a typed partial-index is created (online, no table change); (c) a targeted backfill re-extracts just that attribute across existing docs (bounded, single-attribute, rides Batch + incremental re-ingestion) and fills the JSONB. No full re-ingestion — and the docs were fully retrievable by vector/FTS throughout; promotion only adds filterability.
Versioning (two kinds, don't conflate). This work needs only a light KB-schema version stamp — so "which docs predate this attribute" is answerable for promote-then-backfill. Full document-content versioning (a doc re-uploaded/edited) is a separate, larger capability — the Incremental Re-Ingestion epic — that this work does not block on (it comes after, and composes: re-ingesting a changed doc only re-extracts changed units' attributes).

The JSONB payoff: schema evolution is a metadata + index + bounded-backfill operation, not a table migration + full re-extract.

Architecture

The internal shape of the pipeline and where it interfaces outward.

Architecture Diagram

 raw doc ─▶ convert ─▶ route(docClass probe) ─┬─ STRUCTURED ─▶ deterministic chunk ─┐
                                              │                                      │
                                              └─ PROSE ──────▶ Haiku grouping ───────┤
                                                                                     ▼
                                              ┌──────── attribute extraction ────────┐
                                              │  deterministic (cell / grid / pattern) ─┐
                                              │  LLM fallback (rides Haiku call) ───────┴─▶ typed attrs
                                              └──────────────────────────────────────────┘
                                                                                     ▼
                                                  chunk { text, type, section_id, attrs, embedding }

 ─ retrieval ─────────────────────────────────────────────────────────────────────────
   CONTENT SEARCH      query ─▶ rank by ┬─ vector (semantic)
                                        └─ fts (keyword)

   STRUCTURED ACCESS
     • attribute filter (NEW):  WHERE attrs (range/eq/in/date) ─prunes─▶ rank by [ vector | fts | recency ]
                                (the filter is an optional pre-stage; omit it = today's behavior — see OD12)
     • lookup:  by section_id — hierarchy traversal + document order, no ranking

System Boundary

Inside: conversion, routing, chunking, attribute extraction, embedding, the chunk schema (incl. typed attributes), and the four retrieval operators. Outside: the eval harness (grades this layer, doesn't own it — pipeline-eval-harness); the vertical's entity rollups + drafter (QuoteAI); deploy/tenancy (enterprise-deploy). The one requirement this layer imposes outward: consumers declare the typed attributes they care about (the per-KB attribute schema).

Key Interfaces

Interface	Type	Consumers
`filterRankSearch(kb, {attrFilters, query, recency?, k})`	Function (NEW retrieval op)	QuoteAI spec-match; dev-memory recency; future verticals
Per-chunk typed attributes — JSONB + per-field typed partial-indexes (OD1)	Schema (chunk column)	The operator; the harness scorer
Per-KB attribute schema (declared typed fields + lifecycle)	Config	Ingest extraction; validation; index creation; each consuming library
Attribute-extraction stage (deterministic-first, LLM-fallback)	Pipeline stage	Rides the existing extractor; gated by the harness

Retrieval Operators — Lookup vs Structured-Attribute

The four operators answer four different questions:

Operator	The question	Keys on	Returns
Vector	"find content that means this"	semantic similarity	ranked, fuzzy
FTS	"find content that says these words"	exact lexical tokens	ranked, by keyword
Lookup	"fetch the content located at this address"	the document's declared structure (`section_id`)	exact, document order, hierarchy-aware
Structured-attribute	"find content whose extracted properties satisfy these constraints"	derived facts (hp, cfm, date…)	filter-then-rank

The boundary that matters: lookup keys on the document's intrinsic identity (the address the author assigned; exact, ordered, hierarchy-aware — it walks the sections tree), while structured-attribute keys on facts we derived (ranges, sets, comparisons + ranking). Given structure vs derived facts; retrieve by address vs retrieve by property. They even shine on different doc types — lookup on highly-structured docs with real addresses (regs, contracts), attribute-filter on fact-laden docs you query by value (quotes, pricing sheets).

Could the attribute operator absorb lookup? Partly — section_id is itself a categorical attribute, so exact-match lookup is expressible as WHERE section_id = X. But lookup uniquely adds (1) hierarchy traversal (ask for C7, get its whole subtree — needs the section tree, not flat equality) and (2) document-order contiguous return (the section as written, no relevance ranking). So lookup isn't redundant — it's the structural-address pattern.

The section tree is also the graph-shaped thing: a containment tree (one-relationship graph), and QuoteAI's entity FKs are a second small graph. A full knowledge graph / graph RAG is deferred (see Decisions Log) — it earns its keep only on multi-hop relationship queries we don't have yet; when one appears, it belongs in the entity-rollup layer, not the chunk substrate.

→ Open: OD11 (make section_id a built-in hierarchical attribute, unifying lookup under the filter?) and OD12 (is structured-attribute a distinct operator or a composable pre-filter prepended to vector/fts/recency?).

Eval Integration (the harness is the gate)

This capability is a first-class harness citizen, which is why we can build it before fully measuring it. Two new axes on the existing per-index scorecard:

Extraction accuracy — precision/recall of extracted typed values vs a hand-labeled attribute gold (did we get hp=20, cfm=31.2, lubrication=oilless, date=2019-07-19?). Cheap to score; no retrieval needed.
Filter-then-rank recall — add a structured-filter index to the per-index gold (alongside vector/FTS/lookup), with attribute-filter queries.

Discipline carries over from the harness: baseline-first (delta vs current), per-type floors, significance over point estimates (small-n noise floor stated). An Eval run gates the merge.

Open Decisions (red-team targets)

Status 2026-06-15: all decisions closed. OD1/OD2/OD4 locked; OD3, OD5–OD8 settled; OD9–OD12 locked on review. Ready for the ingestion-foundation epic refinement + red/blue-team.

#	Decision	Resolution
OD1	Storage shape	LOCKED — per-KB declared schema drives *JSONB storage + typed partial* expression-indexes** (one per declared filterable field, scoped by `knowledge_base_id`), with write-time validation. Physical typed-columns-per-KB rejected: runtime DDL + table-per-KB sprawl fights the one-table row-level tenancy model, for a marginal query-speed gain.
OD2	Extraction contract	LOCKED — the per-KB declared attribute schema is the control plane: it targets extraction, validates on write, and defines the filterable surface.
OD3	Granularity / boundary	Settled — attributes on chunks → Autri; entity rollups → the vertical.
OD4	Operator surface	LOCKED — hard filter (`WHERE`) then rank (embed/FTS); recency an optional boost on the survivors.
OD5	Extraction cost/route	Settled — piggyback the Haiku grouping call; deterministic route is free; measure via 013/014.
OD6	Deterministic-vs-LLM routing	Settled — deterministic-first, LLM-fallback; harness measures the split.
OD7	Supersession model	Settled — build on `superseded_at`; recency is a rank signal, history stays visible.
OD8	Eval shape	Settled — extraction-accuracy gold + a structured-filter index in the scorecard.
OD9	Candidate-flagging vs observe-everything	LOCKED → candidate-flagging. Extraction targets the declared schema and cheaply flags candidate new attributes (rides the existing call) → promote-then-backfill. Observe-and-store-everything rejected (speculative extraction + JSONB bloat).
OD10	Onboarding UX	LOCKED → propose-and-curate. LLM proposes a schema from the first docs; the user accepts/renames/prunes; manual declaration also supported (system KBs like dev-memory). Manual-only = friction; fully-auto = drift.
OD11	`section_id` as a built-in attribute?	LOCKED → keep distinct, share storage. Lookup stays its own operator (its hierarchy + document-order semantics differ); it shares the typed-attribute plumbing but is not collapsed into the filter.
OD12	Operator vs composable pre-filter	LOCKED → composable pre-filter. Structured-attribute is a `WHERE` pre-stage that prepends to the vector/fts/recency rankers (one filter, reused across rankers), not a standalone 4th operator.

Epic	Doc	Status	Summary
Ingestion Foundation	ingestion-foundation	Refined (6/15)	The work breakdown — refined to executable story-level detail 2026-06-15 (nine dependency-ordered stories, S0–S8): the eval gold + pass bars, the schema / JSONB / partial-index substrate, the keyword + typed-attribute extraction stage, the filter-then-rank operator, and the schema-curation UI.
Gate-0 Corpus Spike	gate-0-corpus-spike	Re-scoped	No longer a spike-before-build; becomes the eval acceptance of this capability on the real corpus (Slate Trucks + slice), folded into ingestion-foundation.
QuoteAI Vertical	quoteai-vertical	Planned	Consumes the operator for spec-match; keeps its entity rollups + drafter. Output ⑥ (numeric primitive) resolved here.
Dev-Memory	dev-memory	Planned	Second consumer — recency/supersession over session transcripts. ⚠️ Confirm the supersession grain: `superseded_at` is document-level today, which may be the right grain for episode-documents.
Incremental Re-Ingestion	(to be written)	Planned (after)	Document-content versioning: detect a re-uploaded doc, chunk-diff by `content_hash` (content-based, not positional), re-process only changed units. Composes with attribute extraction (re-extract only changed units). The S6 single-attribute backfill does not depend on it.

Cross-Cutting Concerns

Concern	How This Sub-system Is Affected
Cost (D16/D18)	Deterministic extraction is free; LLM extraction rides the existing Haiku call; Batch economics (ingestion-foundation S4) apply. Measured via 013/014.
Multi-tenancy (D13)	Attributes are per-KB; the per-KB attribute schema is the tenant-scoped declaration; partial-indexes are KB-scoped.
LLM-semantics / code-mechanics	The governing principle: LLM extracts the value (and proposes the schema), code stores/filters/indexes it typed.
Local CI/CD for agentic coding	Extraction-accuracy scoring is deterministic → fits the local gate; filter-then-rank recall is Eval-mode (needs Postgres + embeddings).

Decisions Log

Date	Decision	Rationale	Alternatives Considered
2026-06-15	Autri's one retrieval gap is structured-attribute filtering; build it as a generic 4th mode	5 of QuoteAI's 6 tools already map to Autri's 3 operators; the 6th (`search_equipment`) is the only gap, and it's generic	Lift QuoteAI's whole typed schema into Autri (premature, n=1); ship Brehob on QuoteAI as-is (re-opens two-codebases)
2026-06-15	Gate-0 merges into this build; the eval harness is the gate	The architectural de-risking is done by inspection; remaining unknowns need a built thing to measure; the capability is wanted regardless of Brehob	Throwaway spike first (wasteful — the build is wanted anyway)
2026-06-15	Attribute extraction is deterministic-first, LLM-fallback	Spreadsheets/forms are machine-regular (free, exact); 20-year prose is too variable for regex	Regex-everything (brittle on prose); LLM-everything (cost, and pointless on cell-regular sources)
2026-06-15	Typed attributes live on chunks (Autri); entity rollups stay in the vertical	Keeps the substrate generic and the vertical thin; avoids designing the abstraction from one example	Entity tables in Autri (domain-coupled substrate)
2026-06-15	Design the operator for both spec-match and recency/supersession	Two real consumers (Brehob, dev-memory) keep it generic, not quote-shaped	Quote-only filter (would under-serve dev-memory and bake in quote assumptions)
2026-06-15	Typed attributes = per-KB-declared schema over JSONB + typed partial-indexes (not physical columns/tables per KB)	Genericness + multi-tenant fit + cheap schema evolution; physical columns mean runtime DDL + table sprawl for marginal speed	Physical typed columns per KB; shared wide typed table; EAV
2026-06-15	Schema lifecycle = propose-and-curate (bootstrap + candidate-flagging) + promote-then-backfill	Low-friction onboarding, no silent schema drift, evolution is bounded not a migration	Manual-only (friction); auto-expand silently (drift); observe-and-store-all (bloat)
2026-06-15	Knowledge graph / graph RAG deferred	No multi-hop relationship query in the current use cases; hybrid covers single-hop; graphs are heavy + cut against inspectability; section tree + entity FKs already give graph-shaped access	Build a graph store now (premature, unjustified by use cases)
2026-06-15	Document-content versioning split into its own Incremental Re-Ingestion epic; only a light KB-schema-version stamp lives here	Chunk-diff reprocessing is a sizable separate capability; this work needs only schema-version for backfill	Bundle full doc-versioning into this work (scope creep)

Known Issues / Tech Debt

Issue	Severity	Notes
Table / line-item chunking is label-only	High	Converted pipe-tables land as `text`, not `table`; line-items aren't retrievable units. Resolution path in Tables above + ingestion-foundation S2.
Prose FTS = 0 (no lexical anchors)	High	Hybrid collapses to vector-only on exactly the prose that dominates Brehob + dev-memory. Per-chunk keyword metadata (S3) is the adjacent fix.
No deployed-vs-`main` confidence on the ingestion path	Med	Run the diff before building (deploy-hygiene).
Corpus curation cruft	Med	Deep nested-duplicate dirs, `.msg`/`.pst`/`.dwg`, "– copy" trees; an inclusion filter precedes ingest (curation rules, Gate-0 S7).

Sub-system docs define architectural boundaries. The test: remove this layer and multiple unrelated features break. Structured-attribute retrieval is consumed by QuoteAI, dev-memory, and every future vertical — remove it and all three lose hard-constraint + recency retrieval. Update this doc when the retrieval contract or the chunk schema changes.

Ingestion & Retrieval Pipeline — Sub-system Design Doc#

Risks & Constraints#

Overview#

Current Status#

The Story#

What Is This Sub-system?#

The Gap (verified against code)#

Target: Structured-Attribute Retrieval (two halves)#

Deterministic-First, LLM-Fallback Extraction#

Tables — the densest attribute source#

The Autri ↔ QuoteAI Boundary#

Recency & Supersession (the dev-memory consumer)#

Schema Lifecycle#

Architecture#

Architecture Diagram#

System Boundary#

Key Interfaces#

Retrieval Operators — Lookup vs Structured-Attribute#

Eval Integration (the harness is the gate)#

Open Decisions (red-team targets)#

Related Epics#

Cross-Cutting Concerns#

Decisions Log#

Known Issues / Tech Debt#

Review