Foundry Foundry

Ingestion & Retrieval Pipeline — Sub-system Design Doc

The pipeline that turns raw documents into retrievable knowledge, plus the retrieval operators that serve it. This doc adds one new capability — structured-attribute retrieval — and the coupled pipeline changes it requires, positioned against the current (deployed + local) state. It is the architecture layer above ingestion-foundation (roadmap item 1 of brehob-launch) and the generic substrate consumed by QuoteAI (spec-match) and dev-memory (recency). → North Star B1 / B3.

Status: DRAFT — authored 2026-06-15 out of the Gate-0 red-team, which converged on "Autri is missing exactly one retrieval capability." OD1/OD2/OD4 locked 2026-06-15; review folded in same day (lookup-vs-attribute, tables, versioning split, graph-RAG deferral, OD9–OD12). The remaining Open Decisions are the triage targets.


Risks & Constraints

RiskLikelihoodImpactMitigation
Extraction fidelity is poor on the real legacy corpus (20 years of drifting .doc templates)MedHighDeterministic-first, LLM-fallback; the harness measures extraction accuracy per field before we trust it; tune extraction, not rearchitect.
The primitive over-fits to quotes (n=1 vertical)MedMedKeep typed attributes generic on chunks; entity rollups stay in the vertical; validate on the dev-memory consumer too, not just Brehob.
Attribute extraction adds an LLM pass and blows costLowMedPiggyback the existing Haiku grouping call; deterministic route pays nothing; cost measured via migrations 013/014.
Building before measuring (we skipped the throwaway spike)MedLowThe capability is wanted in Autri regardless of Brehob; S0 gold + pass bars are authored before the build; incremental build with a harness checkpoint per step.
Deployed pipeline has drifted behind mainMedMedRun a precise deployed-vs-main diff on the ingestion path as a build-kickoff step (not a design blocker).

Overview

Where this sub-system stands today, how it got here, and what it is.

Current Status

CapabilityStatus
Three retrieval operators — vector (semantic), FTS (keyword), lookup (exact section id)Shipped (deployed) — retrieval/src/{vector-search,fts-search,lookup-section}.ts
Two-path routing — markdown→STRUCTURED deterministic (no LLM); prose→Haiku grouping (chunk-grouping-v3 / prose-v1)Shipped (deployed)
Per-doc cost instrumentation (migrations 013/014) — token + USD per document, vision bucket includedShipped (deployed)
Local eval harness — per-index recall@k + MRR scorecard, query-gold discipline (eval-pipeline skill)Shipped (local-only) — see pipeline-eval-harness
Table / line-item handling — chunk_type:'table' is a label only, no parserGreenfield
Per-chunk keyword metadata (fixes prose FTS=0)Greenfield — ingestion-foundation S3
Structured-attribute extraction + filter-then-rank operator (this doc)Greenfield — the centerpiece

The Story

The Gate-0 red-team set out to measure whether Autri's generic chunk-embed pipeline could ingest the Brehob quote corpus at retrievable fidelity. Grounding the doc's claims against the real repos flipped the premise. Three findings: (1) QuoteAI already extracts quotes into typed records (products with hp_range/cfm_range/psi_range NUMRANGE + lubrication) and retrieves with a hybrid filter-then-rank — the "numeric-filter primitive" the roadmap parked as a maybe-build already exists in the vertical. (2) The spreadsheet→formal-quote transformation is already built (Template-C parser + drafter); its per-line-item retrieval is 3/4 native to Autri (verbatim-description and analog lookups are pure vector/lookup), and the one gap is the numeric spec-match — a validation step, not the spine. (3) Mapping QuoteAI's six retrieval tools onto Autri's three operators, five are already covered; only search_equipment's structured filter is missing.

So the question stopped being "can the generic pipeline do this" and became "Autri is missing exactly one retrieval capability — structured-attribute filtering — and it is generic." A second consumer confirmed it isn't Brehob-specific: dev-memory dogfooding needs the same primitive to weight session decisions by recency (most-recent supersedes; history stays visible). The capability is therefore core substrate, and Gate-0 collapses from "spike before build" into "the eval-acceptance gate on this build."

What Is This Sub-system?

The ingestion + retrieval pipeline owns the path from a raw document to retrievable knowledge: convert → route → chunk → extract attributes → embed → index, and the operators that query the result. It exists as its own layer because every consumer (QuoteAI, dev-memory, every future vertical) plugs into the same retrieval contract; a change here ripples to all of them. This doc's addition — structured-attribute retrieval — is the fourth retrieval mode alongside vector / FTS / lookup, and the ingestion step that feeds it.


The Gap (verified against code)

Autri exposes exactly three retrieval operators and no structured-attribute filter:

  • vectorSearch — cosine over embeddings; filters only on knowledge_base_id, documentIds, chunkTypes.
  • ftsSearch — Postgres FTS; same filter surface.
  • lookupSection — exact section_id match (e.g. C7.6.2); not an attribute query.

Chunks carry text + chunk_type + section_id + embedding + bboxzero typed numeric/categorical attributes, and ingest extracts none. The concrete test "find equipment where 10 ≤ hp ≤ 20 AND lubrication = oilless AND cfm ≥ 90, ranked by relevance" cannot be answered today — no typed columns, no filter operator. (hybrid_search exists as a type name only; unimplemented.)


Target: Structured-Attribute Retrieval (two halves)

Half 1 — Generate typed attributes during ingest. At chunk time, attach typed attributes to each chunk: numeric (hp, cfm, psi), categorical (lubrication, manufacturer), temporal (date). This is the "LLM does semantics, code does mechanics" principle made concrete: read the value out of messy text, store it typed and queryable.

Half 2 — A filter-then-rank operator. A fourth retrieval mode: WHERE on typed attributes (range / equality / set / date), composed with ORDER BY embedding distance (or FTS rank). Hard constraints prune first; semantic similarity ranks the survivors — exactly QuoteAI's search_equipment shape, generalized.

Both are required: extraction without the operator is inert data; the operator without extraction has nothing to filter.

Deterministic-First, LLM-Fallback Extraction

The split is not "structured docs → regex, prose → LLM." It's how machine-regular the source is — mirroring the existing chunking philosophy (code is the default; the LLM fires only where the document doesn't declare its own structure):

  • Machine-regular sources → deterministic/code extraction (exact, free, no LLM). A small family, by how the value is addressed:
    • Cell-coordinate — the value sits at a known (row, col). Template-C already does this: CFM = col 9, HP = col 11; map columns → declared attributes. Strongest case; needs the template layout.
    • Table-grid parse — a (converted) markdown pipe-table's header row gives column names, cells give values; parse the grid, map columns → attributes. (See Tables below.)
    • Labeled-pattern (regex) — a value following a regular label ("SYSTEM CAPACITY: 31.2 SCFM" → CAPACITY:\s*([\d.]+)). The brittle one; only "deterministic" where the label format is truly regular, else it's the LLM's job.
  • Variable prose → LLM extraction. The 8,600+ .doc quotes span 20 years and many templates ("PLEX HORSEPOWER" vs "HP" vs "Horsepower"; drifting units/layout). The LLM's job is "get the HP however it's labeled," piggybacked on the existing Haiku grouping call (no second pass).

Note: "structured doc → no LLM" was about chunking. For attributes, even a structured doc can need the LLM if its values live in free prose — the split is value-regularity, not doc-type. We don't guess it; the harness measures deterministic coverage and the LLM fills the remainder, scored on extraction accuracy.

Tables — the densest attribute source

A table is where conversion, chunking, and attribute-extraction meet — and it's the richest attribute source in the corpus (a pricing table's CFM/PSI/HP/LIST columns are the typed attributes). The path:

  1. Convert to markdown (the conversion stage, ingestion-foundation S1) → the table in a parseable pipe-table form. Necessary, not sufficient.
  2. Grid-parse (the deterministic "table-grid" path above) → columns become typed attributes.
  3. Row-level granularity → each row becomes its own chunk carrying the column-attributes, so "the 45hp Powerex line item and its price" is retrievable — displayed as the whole table (retrieval ≠ display granularity).

Today the chunker leaves converted tables as chunk_type:'text' (no row granularity, no cell-attributes); closing that is the table-handling work in ingestion-foundation S2, and it's the precondition for line-item-level structured retrieval.

The Autri ↔ QuoteAI Boundary

Attributes live on chunks (generic, filterable) — this is Autri substrate. Entity rollups (QuoteAI's products / quotes typed tables, used for cross-corpus aggregation like "the cheapest oilless 15hp system we've quoted") stay in the vertical. The trade: per-chunk attributes answer "find content matching these specs" cleanly; entity-level aggregation/dedup is weaker per-chunk and belongs to the vertical that needs it. This line keeps the substrate generic and the vertical thin.

Recency & Supersession (the dev-memory consumer)

The dev-memory dogfood needs more than date > X: when a decision flips across sessions, the current session must see both the standing decision and the history, with the most-recent treated as authoritative. Resolved 2026-06-15 (ingestion-foundation S5 red-team): this is served by the chunk-level date attribute (a first-class typed attribute) + recency as a rank-boostnot by supersession. History stays retrievable, just ranked lower.

Hard supersession is a separate, document-level concept: superseded_at lives on documents, not chunks (an earlier draft of this section said "on chunks" — corrected), and retrieval already honors it via includeSuperseded?. It's for a replaced document version, not a down-ranked older decision. So the operator needs no new chunk-level temporal model — recency-rank for "newest wins, history visible", document-level supersession for replaced versions. Designing it with Brehob's spec-match and dev-memory's recency in view is what keeps it generic rather than quote-shaped.


Schema Lifecycle

Defining the schema and evolving it are the same flow — the system proposes, the user curates — at different times. JSONB storage (OD1) is what makes this cheap.

  1. Bootstrap (KB creation). The user doesn't hand-author a typed schema cold. On the first batch of docs, the extractor proposes a candidate schema (a townhouse-purchase KB → purchase_price, closing_date, address, loan_amount, interest_rate); the user curates — accept / rename / prune — in the UI. System-controlled KBs (dev-memory) may declare directly. The user curates; they don't author from scratch.
  2. Steady state. Docs with values for already-declared attributes are just extracted. Because extraction targets the declared schema, synonyms collapse ("sq ft" → declared square_footage); only genuinely new concepts escape.
  3. Expansion. A doc introducing a new attribute is flagged as a candidate (rides the existing call — "also saw property_tax_rate = 1.2%"); the schema does not silently expand. The system surfaces "seen property_tax_rate in N docs — promote to a filterable field?" Noise stays out until the user opts in.
  4. Promote-then-backfill. On promote: (a) the field is added to the KB schema; (b) a typed partial-index is created (online, no table change); (c) a targeted backfill re-extracts just that attribute across existing docs (bounded, single-attribute, rides Batch + incremental re-ingestion) and fills the JSONB. No full re-ingestion — and the docs were fully retrievable by vector/FTS throughout; promotion only adds filterability.
  5. Versioning (two kinds, don't conflate). This work needs only a light KB-schema version stamp — so "which docs predate this attribute" is answerable for promote-then-backfill. Full document-content versioning (a doc re-uploaded/edited) is a separate, larger capability — the Incremental Re-Ingestion epic — that this work does not block on (it comes after, and composes: re-ingesting a changed doc only re-extracts changed units' attributes).

The JSONB payoff: schema evolution is a metadata + index + bounded-backfill operation, not a table migration + full re-extract.


Architecture

The internal shape of the pipeline and where it interfaces outward.

Architecture Diagram

 raw doc ─▶ convert ─▶ route(docClass probe) ─┬─ STRUCTURED ─▶ deterministic chunk ─┐
                                              │                                      │
                                              └─ PROSE ──────▶ Haiku grouping ───────┤
                                                                                     ▼
                                              ┌──────── attribute extraction ────────┐
                                              │  deterministic (cell / grid / pattern) ─┐
                                              │  LLM fallback (rides Haiku call) ───────┴─▶ typed attrs
                                              └──────────────────────────────────────────┘
                                                                                     ▼
                                                  chunk { text, type, section_id, attrs, embedding }

 ─ retrieval ─────────────────────────────────────────────────────────────────────────
   CONTENT SEARCH      query ─▶ rank by ┬─ vector (semantic)
                                        └─ fts (keyword)

   STRUCTURED ACCESS
     • attribute filter (NEW):  WHERE attrs (range/eq/in/date) ─prunes─▶ rank by [ vector | fts | recency ]
                                (the filter is an optional pre-stage; omit it = today's behavior — see OD12)
     • lookup:  by section_id — hierarchy traversal + document order, no ranking

System Boundary

Inside: conversion, routing, chunking, attribute extraction, embedding, the chunk schema (incl. typed attributes), and the four retrieval operators. Outside: the eval harness (grades this layer, doesn't own it — pipeline-eval-harness); the vertical's entity rollups + drafter (QuoteAI); deploy/tenancy (enterprise-deploy). The one requirement this layer imposes outward: consumers declare the typed attributes they care about (the per-KB attribute schema).

Key Interfaces

InterfaceTypeConsumers
filterRankSearch(kb, {attrFilters, query, recency?, k})Function (NEW retrieval op)QuoteAI spec-match; dev-memory recency; future verticals
Per-chunk typed attributes — JSONB + per-field typed partial-indexes (OD1)Schema (chunk column)The operator; the harness scorer
Per-KB attribute schema (declared typed fields + lifecycle)ConfigIngest extraction; validation; index creation; each consuming library
Attribute-extraction stage (deterministic-first, LLM-fallback)Pipeline stageRides the existing extractor; gated by the harness

Retrieval Operators — Lookup vs Structured-Attribute

The four operators answer four different questions:

OperatorThe questionKeys onReturns
Vector"find content that means this"semantic similarityranked, fuzzy
FTS"find content that says these words"exact lexical tokensranked, by keyword
Lookup"fetch the content located at this address"the document's declared structure (section_id)exact, document order, hierarchy-aware
Structured-attribute"find content whose extracted properties satisfy these constraints"derived facts (hp, cfm, date…)filter-then-rank

The boundary that matters: lookup keys on the document's intrinsic identity (the address the author assigned; exact, ordered, hierarchy-aware — it walks the sections tree), while structured-attribute keys on facts we derived (ranges, sets, comparisons + ranking). Given structure vs derived facts; retrieve by address vs retrieve by property. They even shine on different doc types — lookup on highly-structured docs with real addresses (regs, contracts), attribute-filter on fact-laden docs you query by value (quotes, pricing sheets).

Could the attribute operator absorb lookup? Partly — section_id is itself a categorical attribute, so exact-match lookup is expressible as WHERE section_id = X. But lookup uniquely adds (1) hierarchy traversal (ask for C7, get its whole subtree — needs the section tree, not flat equality) and (2) document-order contiguous return (the section as written, no relevance ranking). So lookup isn't redundant — it's the structural-address pattern.

The section tree is also the graph-shaped thing: a containment tree (one-relationship graph), and QuoteAI's entity FKs are a second small graph. A full knowledge graph / graph RAG is deferred (see Decisions Log) — it earns its keep only on multi-hop relationship queries we don't have yet; when one appears, it belongs in the entity-rollup layer, not the chunk substrate.

→ Open: OD11 (make section_id a built-in hierarchical attribute, unifying lookup under the filter?) and OD12 (is structured-attribute a distinct operator or a composable pre-filter prepended to vector/fts/recency?).


Eval Integration (the harness is the gate)

This capability is a first-class harness citizen, which is why we can build it before fully measuring it. Two new axes on the existing per-index scorecard:

  1. Extraction accuracy — precision/recall of extracted typed values vs a hand-labeled attribute gold (did we get hp=20, cfm=31.2, lubrication=oilless, date=2019-07-19?). Cheap to score; no retrieval needed.
  2. Filter-then-rank recall — add a structured-filter index to the per-index gold (alongside vector/FTS/lookup), with attribute-filter queries.

Discipline carries over from the harness: baseline-first (delta vs current), per-type floors, significance over point estimates (small-n noise floor stated). An Eval run gates the merge.


Open Decisions (red-team targets)

Status 2026-06-15: all decisions closed. OD1/OD2/OD4 locked; OD3, OD5–OD8 settled; OD9–OD12 locked on review. Ready for the ingestion-foundation epic refinement + red/blue-team.

#DecisionResolution
OD1Storage shapeLOCKED — per-KB declared schema drives JSONB storage + typed partial expression-indexes (one per declared filterable field, scoped by knowledge_base_id), with write-time validation. Physical typed-columns-per-KB rejected: runtime DDL + table-per-KB sprawl fights the one-table row-level tenancy model, for a marginal query-speed gain.
OD2Extraction contractLOCKED — the per-KB declared attribute schema is the control plane: it targets extraction, validates on write, and defines the filterable surface.
OD3Granularity / boundarySettled — attributes on chunks → Autri; entity rollups → the vertical.
OD4Operator surfaceLOCKEDhard filter (WHERE) then rank (embed/FTS); recency an optional boost on the survivors.
OD5Extraction cost/routeSettled — piggyback the Haiku grouping call; deterministic route is free; measure via 013/014.
OD6Deterministic-vs-LLM routingSettled — deterministic-first, LLM-fallback; harness measures the split.
OD7Supersession modelSettled — build on superseded_at; recency is a rank signal, history stays visible.
OD8Eval shapeSettled — extraction-accuracy gold + a structured-filter index in the scorecard.
OD9Candidate-flagging vs observe-everythingLOCKED → candidate-flagging. Extraction targets the declared schema and cheaply flags candidate new attributes (rides the existing call) → promote-then-backfill. Observe-and-store-everything rejected (speculative extraction + JSONB bloat).
OD10Onboarding UXLOCKED → propose-and-curate. LLM proposes a schema from the first docs; the user accepts/renames/prunes; manual declaration also supported (system KBs like dev-memory). Manual-only = friction; fully-auto = drift.
OD11section_id as a built-in attribute?LOCKED → keep distinct, share storage. Lookup stays its own operator (its hierarchy + document-order semantics differ); it shares the typed-attribute plumbing but is not collapsed into the filter.
OD12Operator vs composable pre-filterLOCKED → composable pre-filter. Structured-attribute is a WHERE pre-stage that prepends to the vector/fts/recency rankers (one filter, reused across rankers), not a standalone 4th operator.
EpicDocStatusSummary
Ingestion Foundationingestion-foundationRefined (6/15)The work breakdown — refined to executable story-level detail 2026-06-15 (nine dependency-ordered stories, S0–S8): the eval gold + pass bars, the schema / JSONB / partial-index substrate, the keyword + typed-attribute extraction stage, the filter-then-rank operator, and the schema-curation UI.
Gate-0 Corpus Spikegate-0-corpus-spikeRe-scopedNo longer a spike-before-build; becomes the eval acceptance of this capability on the real corpus (Slate Trucks + slice), folded into ingestion-foundation.
QuoteAI Verticalquoteai-verticalPlannedConsumes the operator for spec-match; keeps its entity rollups + drafter. Output ⑥ (numeric primitive) resolved here.
Dev-Memorydev-memoryPlannedSecond consumer — recency/supersession over session transcripts. ⚠️ Confirm the supersession grain: superseded_at is document-level today, which may be the right grain for episode-documents.
Incremental Re-Ingestion(to be written)Planned (after)Document-content versioning: detect a re-uploaded doc, chunk-diff by content_hash (content-based, not positional), re-process only changed units. Composes with attribute extraction (re-extract only changed units). The S6 single-attribute backfill does not depend on it.

Cross-Cutting Concerns

ConcernHow This Sub-system Is Affected
Cost (D16/D18)Deterministic extraction is free; LLM extraction rides the existing Haiku call; Batch economics (ingestion-foundation S4) apply. Measured via 013/014.
Multi-tenancy (D13)Attributes are per-KB; the per-KB attribute schema is the tenant-scoped declaration; partial-indexes are KB-scoped.
LLM-semantics / code-mechanicsThe governing principle: LLM extracts the value (and proposes the schema), code stores/filters/indexes it typed.
Local CI/CD for agentic codingExtraction-accuracy scoring is deterministic → fits the local gate; filter-then-rank recall is Eval-mode (needs Postgres + embeddings).

Decisions Log

DateDecisionRationaleAlternatives Considered
2026-06-15Autri's one retrieval gap is structured-attribute filtering; build it as a generic 4th mode5 of QuoteAI's 6 tools already map to Autri's 3 operators; the 6th (search_equipment) is the only gap, and it's genericLift QuoteAI's whole typed schema into Autri (premature, n=1); ship Brehob on QuoteAI as-is (re-opens two-codebases)
2026-06-15Gate-0 merges into this build; the eval harness is the gateThe architectural de-risking is done by inspection; remaining unknowns need a built thing to measure; the capability is wanted regardless of BrehobThrowaway spike first (wasteful — the build is wanted anyway)
2026-06-15Attribute extraction is deterministic-first, LLM-fallbackSpreadsheets/forms are machine-regular (free, exact); 20-year prose is too variable for regexRegex-everything (brittle on prose); LLM-everything (cost, and pointless on cell-regular sources)
2026-06-15Typed attributes live on chunks (Autri); entity rollups stay in the verticalKeeps the substrate generic and the vertical thin; avoids designing the abstraction from one exampleEntity tables in Autri (domain-coupled substrate)
2026-06-15Design the operator for both spec-match and recency/supersessionTwo real consumers (Brehob, dev-memory) keep it generic, not quote-shapedQuote-only filter (would under-serve dev-memory and bake in quote assumptions)
2026-06-15Typed attributes = per-KB-declared schema over JSONB + typed partial-indexes (not physical columns/tables per KB)Genericness + multi-tenant fit + cheap schema evolution; physical columns mean runtime DDL + table sprawl for marginal speedPhysical typed columns per KB; shared wide typed table; EAV
2026-06-15Schema lifecycle = propose-and-curate (bootstrap + candidate-flagging) + promote-then-backfillLow-friction onboarding, no silent schema drift, evolution is bounded not a migrationManual-only (friction); auto-expand silently (drift); observe-and-store-all (bloat)
2026-06-15Knowledge graph / graph RAG deferredNo multi-hop relationship query in the current use cases; hybrid covers single-hop; graphs are heavy + cut against inspectability; section tree + entity FKs already give graph-shaped accessBuild a graph store now (premature, unjustified by use cases)
2026-06-15Document-content versioning split into its own Incremental Re-Ingestion epic; only a light KB-schema-version stamp lives hereChunk-diff reprocessing is a sizable separate capability; this work needs only schema-version for backfillBundle full doc-versioning into this work (scope creep)

Known Issues / Tech Debt

IssueSeverityNotes
Table / line-item chunking is label-onlyHighConverted pipe-tables land as text, not table; line-items aren't retrievable units. Resolution path in Tables above + ingestion-foundation S2.
Prose FTS = 0 (no lexical anchors)HighHybrid collapses to vector-only on exactly the prose that dominates Brehob + dev-memory. Per-chunk keyword metadata (S3) is the adjacent fix.
No deployed-vs-main confidence on the ingestion pathMedRun the diff before building (deploy-hygiene).
Corpus curation cruftMedDeep nested-duplicate dirs, .msg/.pst/.dwg, "– copy" trees; an inclusion filter precedes ingest (curation rules, Gate-0 S7).

Sub-system docs define architectural boundaries. The test: remove this layer and multiple unrelated features break. Structured-attribute retrieval is consumed by QuoteAI, dev-memory, and every future vertical — remove it and all three lose hard-constraint + recency retrieval. Update this doc when the retrieval contract or the chunk schema changes.

Review

🔒

Enter your access token to view annotations