Foundry Foundry

Pipeline Evaluation & Optimization Harness — Sub-system Design Doc

An LLM-free-in-the-gate evaluation + optimization platform over the ingestion pipeline (parse → structure → segment → chunk → embed → retrieve). It scores any pipeline configuration against curated, span-anchored gold-standard output on two axes — quality and cost — and surfaces the Pareto frontier so we can operate at the knee: lowest cost at acceptable quality. It grows along two axes over time (documents and queries), turning unforeseen edge cases into permanent regression guards.

Status: DRAFT v2 — for review + red-team/blue-team. Authored 2026-06-02.


Architecture

Overview

Risks & Constraints

RiskLikelihoodImpactMitigation
Gold-standard authoring is a slog and stalls the effortMedHighCurate-by-correction (emit current output → hand-fix → commit). Seed query gold from section headings. Start boundary-only, 3 docs.
Heuristics over-fit to the corpus (great in-sample, fail in the wild)MedMedGrow corpus with adversarial edge cases; keep a hold-out set never used for tuning; score per doc-class and per question-type, not just aggregate.
Optimizer cheats into cheap-but-bad configsMedHighThe objective is min-cost subject to a hard quality floor (per question-type). No floor → no optimization.
Reprocessing cost when an upgrade fixes quality for existing customersHighMedVersioned + targeted + incremental reprocessing (reuse cached upstream stages); prioritize docs below the floor + paying tiers.
Retrieval-eval mistaken as nondeterministicResolved: measure span-level recall@k (deterministic), not LLM-judged answer quality. Pin the embedding-model version.
Copyright / repo bloat as the corpus growsMedLowGoldens carry ids/spans/counts only, no body text; source PDFs stay in gitignored cache/; compact golden JSON.
In-app feedback corpus uses tenant data without rightsMedHighExplicit opt-in consent; privacy-preserving fallback (aggregate signal + synthesized equivalents) when rights are absent.

Current Status

CapabilityStatus
L0 — segment.ts probe + section segmenter (classification + atoms)Shipped (2026-06-02)
L0 — segment.test.ts golden harness: snapshot + assertions, 12 docs / 4 classes, <1s, in pre-commit gateShipped
L1 — Boundary-set gold (span-anchored) + boundary P/R/F1 scorerPlanned
L2 — Region labels (tables/figures) + detection scorerPlanned
L3 — Retrieval gold (query → answer span) + recall@k scorer (embeddings)Planned
Query-type taxonomy + per-type scorecards/floorsPlanned
Config-parameterized pipeline + A/B frontier sweep (Eval mode)Planned
Composite pipeline-version fingerprint + targeted/incremental reprocessingPlanned
In-app quality feedback → corpus (consented)Planned (later)

The Story

This session began as extraction-cost work (#49 — gate page images to figure/sparse pages). Diagnosing it pulled us into chunking: on FIA Section C the chunks came out page-sized. Root cause was the structure stage under-segmenting — it keys headings on font size, but FIA/STEM section headers render at the same height as body text, so every line classifies as body and coalesces into one page-sized atom. The real boundary signal is the document's own indentation + numbering, which the pipeline ignored.

A deterministic spike proved we can recover the full section hierarchy (C1–C18, 910+ sections) from geometry alone, and that documents fall into a few structural classes (STRUCTURED / FLAT-VERSE / PROSE) separable by cheap, doc-relative signals (sequentiality, header coverage, indent gap). That produced segment.ts and a golden-corpus harness.

The pivotal realization: the integration design questions — granularity, table/figure handling, routing, code-vs-LLM — can't be answered by argument, only by measurement. Chatting it through, the idea grew from a regression harness into a pipeline optimization platform: parameterize the pipeline, evaluate configs on quality + cost, and operate at the frontier knee. Chunking quality is the Autri wedge (the inspector's whole value is legible, correct extraction); this is the instrument that drives it, never regresses it, and self-improves as documents and real user queries flow in.

What Is This Sub-system?

A development-time evaluation + optimization platform over the ingestion pipeline. It owns a growing corpus of real + synthetic documents, curated span-anchored gold-standard outputs at escalating fidelity (boundaries → regions → retrieval), a query gold set organized by question-type, and the scorers that grade any pipeline configuration on quality and cost. It runs in two modes: a fast deterministic Gate (pre-commit) and an on-demand Eval that sweeps configs and emits the (quality, cost) frontier. It exists as its own layer because every chunking, segmentation, routing, and reprocessing decision downstream depends on it as the source of truth.


Chunking Philosophy: the LLM is a Fallback, not the Engine

The governing principle, settled by reasoning + the spike:

  • Deterministic code is the default, and for structured technical docs it is not a compromise — it's more correct. A reg author already encoded the semantic units in the numbering (C4.1 Minimum mass is one rule, deliberately). We read that structure; we don't pay an LLM to re-infer it.
  • The unit is the authored section/subsection, not the line. "Every line = a chunk" overshoots into fragments: embeddings become underspecified and a single answer smears across chunks (the heading line and the number line separate). The semantic unit — the subsection (regs) or paragraph (prose) — is the target.
  • The hierarchy is the asset. The segmenter recovers the whole tree (C3 → C3.5 → C3.5.1), which enables small-to-big retrieval: embed fine for precise matching, return the parent section for complete context. This may dissolve the granularity dilemma entirely (embed at the semantic unit, display at the readable unit).
  • Selective, per-region LLM triggering. The router decides code | llm | vision per region, not per document. A STRUCTURED doc is mostly code-chunked sections, with the LLM firing only on the spans that earn it. This is what lets cost track quality.

Where the LLM is genuinely irreplaceable — all cases where the document does not declare its own boundaries:

  1. Unstructured prose (novels, proposals) — ideal chunks span paragraphs in ways only semantics sees.
  2. Tables & figures — non-linear meaning; code detects the region, the model renders it to retrievable text (overlaps #49 needs_vision).
  3. Edge repair — a giant authored section with no sub-numbering that's really many ideas (rare).

Where it is not needed (often mistaken as LLM-only): content inside one authored section stays together for free; syntactic continuation ("next paragraph starts mid-sentence, lowercase, no terminal punctuation") is a deterministic merge the pipeline already does. The LLM is reserved for semantic grouping in unstructured text — a far smaller surface than it first appears. Even then, the LLM decides boundaries; code does the mechanics.


The Objective: Minimize Cost Under a Quality Floor

"Maximize profit at high quality" makes precise as: the pipeline controls cost (LLM spend); revenue is set by pricing (D18); so the objective is to minimize cost subject to quality ≥ a floor.

  • The quality/cost frontier has a knee — a point past which more spend buys almost no quality. That knee is a data fact, not a taste call (the objective is objective, per Dan). We operate at the knee.
  • The quality floor is a hard constraint and must be a number (e.g. retrieval recall@k thresholds, set per question-type). Without it, "minimize cost" trivially → cheapest garbage.
  • Asymmetry principle: chunking cost is one-time (ingestion); chunking quality is forever (every query the doc serves). Below the floor, never trade quality for the ingestion saving — the saving is paid once, bad answers recur indefinitely.

The harness's job: draw the frontier and locate the knee; the floor guards the optimizer from cheating.


Architecture Diagram

            ┌──────── configurable pipeline (the thing under test) ─────────┐
 source ──▶ parse ─▶ structure ─▶ [probe] ─▶ router(config) ─per region─▶ { code | llm | vision } ─▶ chunks ─▶ embed
            └───────────────────────────────────────────────────────────────┘     │ scored against ▼
 ┌──────────────────── Pipeline Eval & Optimization Harness ─────────────────────────────────────────────┐
 │  CORPUS              GOLD (span-anchored)            SCORERS                MODES                       │
 │  ├ synthetic (CI)    ├ L1 boundary sets             ├ boundary P/R/F1      ├ GATE  (CI, <1s, det.)     │
 │  └ real (cache)      ├ L2 region labels             ├ region detection     │   boundaries+regions+route│
 │  + query gold        └ L3 query → answer span       ├ retrieval recall@k   └ EVAL  (on-demand)         │
 │    (by question-type)                               └ cost model/meter         config sweep → (Q,$) ▷ frontier
 └───────────────────────────────────────────────────────────────────────────────────────────────────────┘

System Boundary

Inside: the corpus + manifest, all gold fixtures (boundary / region / query), the scorers, the cost model, the config-sweep runner, both run modes, and reporting.

Outside: the pipeline code itself — the harness grades it, doesn't own it. The production chunk schema, inspector, and retrieval live outside; the harness asserts against their contracts. The pipeline must expose a config surface (below) so the harness can drive it — that's the one requirement the harness imposes outward.

Key Interfaces

InterfaceTypeConsumers
pipeline(config, doc) → chunks (parameterized, pluggable)FunctionHarness Eval mode; A/B sweeps
probeAndSegment(pages) → {metrics, docClass, atoms}Function (segment.ts)Harness; the live router
Span-anchored gold fixtures (boundary / region / query)FixtureL1–L3 scorers
score(actual, gold) → metrics + costModel(config, doc) → usdFunctionGate + Eval suites
Composite pipeline fingerprint (per document)DB columnReprocessing targeting; staleness reporting
--emit-gold / UPDATE_GOLDENS=1CLI modeCurate-by-correction authoring

The Gold-Standard Model (span-anchored)

The harness today validates stability (snapshots) + a few assertions. It does not yet encode what the right output is. Gold standard is that ground truth — and the critical design move is that gold is anchored to document positions (section_id / char-span), never to chunk IDs. Because chunk IDs differ per config, span-anchoring is what makes the same gold score any chunking — the precondition for A/B-ing configs apples-to-apples.

Three escalating layers:

  1. Boundary set (cheap, deterministic)[(section_id, page, anchor)]: where a chunk should begin. Scored by boundary P/R/F1. Cheap because the segmenter is ~90% right on STRUCTURED docs — we curate its output, not author from scratch. Answers granularity (merge/split rules).
  2. Region labels (medium, deterministic)[(page, bbox, type)] for tables/figures. Reuses #49 needs_vision as a candidate signal. Answers table/figure handling.
  3. Retrieval gold (deepest; embeddings, no grouping-LLM)query → answer span(s). A retrieval is correct if any top-k chunk overlaps the gold span. Answers "is code as good as LLM" + granularity-for-retrieval — the real arbiter.

Authoring principle — curate-by-correction: --emit-gold dumps current output in gold format; a human fixes the known-wrong spots; commit. Marginal cost of a new gold doc stays small; the corpus grows organically as edge cases surface.


Measuring Quality (and why it stays deterministic)

The determinism worry resolves by separating two things often conflated:

LayerMeasuresDeterministic?When
Boundary F1 (vs span gold)structural correctness (proxy)✅ fullyevery commit (Gate)
Retrieval recall@k (vs answer-span gold)true retrieval quality✅ at span/ID level — same model + text → same top-kperiodic / per-config sweep
Answer-quality (LLM judge)end-to-end usefulnessoccasional spot-check, never a gate

The key: measure whether the right chunk was retrieved (deterministic), not whether the generated answer is good (LLM-judged, nondeterministic). Pin the embedding-model version; float noise doesn't flip top-k. So the real quality arbiter (recall@k) drives the frontier without an LLM in the loop.


Query Gold & the Question-Type Taxonomy

Quality is meaningless without queries. We build the query set two ways and organize it by question-type — which turns one fuzzy number into an actionable scorecard:

Question typeExampleStresses
Lookup / factoid"What's the minimum mass?"basic recall + granularity
Vocabulary-mismatch"How heavy must the car be?"embedding robustness (favors small+context)
Cross-reference"Mass limit and the penalty for it"multi-span → argues for hierarchical retrieval
Scoping"List all aerodynamic component rules"needs the C3 subtree → tests the hierarchy
Definitional"What counts as bodywork?"are definitions retrievably chunked
Negative / out-of-scope"Tire pressure rule?" (absent)precision — not retrieving false relevance
Figure/table"Dimensions in Figure 3?"the vision path
  • Seed-from-headings (auto, breadth): every section → a query whose gold span is that section. Free coverage; proves recall.
  • Hand-authored typed questions (depth, ~5–10/doc): the vocab-mismatch and cross-reference cases that actually discriminate between configs.
  • Per-type scoring → per-type floors. Report recall by type (lookups must be ≥0.95; cross-reference ≥0.8). Weak cross-reference → adopt hierarchical; weak vocab-mismatch → contextual prefixes. Far more actionable than one aggregate.
  • Queries are the second growth axis. In prod, a real query that retrieves badly gets categorized, span-labeled, and added as a permanent gold case (possibly a new type). Documents grow the corpus structurally; queries grow it for retrieval.

Pipeline Configurability & A/B (Eval mode)

For "import different pipeline versions and compare them," the pipeline must be parameterized by a config object that fully determines behavior — routing thresholds, chunk granularity, hierarchical-vs-flat, contextual-prefix on/off, embedding model, chunk-size targets. Then Eval mode sweeps the config matrix over the corpus, scoring each on quality (recall@k, per-type) and cost (modeled deterministically: LLM-routed regions × token estimate × price, plus #49 vision spend), and emits the (quality, cost) frontier. We read off the knee and ship that config. This is the optimization machine: same machinery as the gate, bigger ambition.


Versioning & Reprocessing

Current state (verified): documents.extractor_version and chunks.{extractor_version, embedder_version} are stamped (<prompt-version>/<model>). STRUCTURE_VERSION lives only in cached JSON, not the DB. There is no composite pipeline fingerprint and no segment_version — so we cannot today answer "which docs were ingested under config X."

Design addition — composite pipeline fingerprint per document: parse·structure·segment·extractor·embedder·routing-config-hash. Then "who needs reprocessing after fix X" = WHERE fingerprint predates X AND X's scope applies.

Reprocessing is targeted + incremental, never blanket:

  • Targeted — only docs the upgrade actually affects (structured-chunking fix → only STRUCTURED docs; figure fix → only docs with figures).
  • Incremental — reuse cached upstream stages (segmenter-only change → skip parse/OCR, reuse words.json, re-run segment→chunk→embed). Dovetails with the Incremental Re-Ingestion sub-system; stacks with #49/#50.
  • Prioritized — docs currently below the quality floor first, paying tiers first. The floor makes prioritization objective.

In-App Quality Feedback → Corpus (later)

The prod-sourced query-growth loop: a 👎 on a chat answer captures (query, retrieved chunks, doc) → triage → tag with a question-type → add as a permanent gold case. Constraints: explicit consent to use tenant doc + query for pipeline development (D13 territory); a privacy-preserving fallback — even without doc rights, the aggregate signal (👎 + which chunks missed) is useful telemetry and we can synthesize equivalent gold without the sensitive content. Later feature; it's what makes the optimization machine self-improving in production.


Eval Dimensions → Design Questions

Design questionMeasured viaLLM-free?
Granularity / tiny-merge / huge-splitL1 boundary F1
Tables & figures inside structured docsL2 region detection
Chunk-record validity (bbox in-bounds, section_id)schema validators
Routing / WEAK-band thresholddocClass accuracy over labeled corpus
Code-vs-LLM; how much the code path replacesL3 recall@k: code vs LLM chunks, same span gold✅ (embeddings)
Hierarchical vs flat chunkingL3 recall@k by question-type (esp. cross-reference/scoping)✅ (embeddings)
Operating configEval-mode frontier knee under per-type floors✅ (cost modeled)

Open Decisions (red-team / blue-team targets)

Resolved in the 2026-06-02 red-team (see Decisions Log): gold volume/authoring (OD1 partial), gold-rot-vs-versions, index scope, reprocessing billing (OD11), baseline-first, cost-model calibration, quality-floor calibration (OD5), hierarchical-as-experiment (OD9), sweep embedding cost, no-knee operating rule, probe-confidence fallback, Eval-gates-merge (OD8).

Still open — for the integration epic / blue-team:

  • OD2 — Span-anchor serialization. Anchor basis is settled (section_id). Open: how to address a span within a section (char offset? paragraph index?) for retrieval gold, so it's precise without being brittle to re-parsing.
  • OD6 — Answer-judge cadence. The LLM answer-quality spot-check is explicitly non-gate; open: how often it runs and what triggers it (pre-ship? per-release?).
  • OD7 — Corpus growth + copyright at scale. Golden compaction (in tech debt), digesting large cache-only docs, and the policy for third-party-copyrighted docs as the corpus grows.
  • OD10 — Pipeline config surface. What fields the config object exposes, and how invasive parameterizing the existing pipeline is to expose them (scope estimate needed before Eval mode is real).
  • OD12 — Feedback-capture consent model. The opt-in UX for using a tenant's doc+query in the corpus, and the privacy-preserving fallback (aggregate signal + synthesized equivalents) when rights are absent. Later feature.
  • OD3 — Hold-out discipline. First-curate docs are settled (clean reg / nasty FIA-C / negative prose). Open: the size + rotation of the tuning hold-out set so we measure over-fit, not just in-sample fit.

EpicDocStatusSummary
Segmenter integration (hybrid router)(to be written)PlannedWire segment.ts into the live pipeline (per-region routing); harness gates it.
Incremental Re-Ingestionsub-systems/incremental-re-ingestion.mdExistingThe reuse-cached-stages substrate targeted/incremental reprocessing builds on.
#49 image-gatingGitHub #49ShippedPer-page needs_vision; the figure signal L2 reuses.
#50 cheaper-model spikeGitHub #50PlannedCode-driven chunking shrinks the LLM surface; compounds the cost win.

Cross-Cutting Concerns

ConcernHow This Sub-system Is Affected
Cost (D16/D18)Code-driven chunking removes most grouping-LLM spend; the harness proves quality holds before banking it, and prices the frontier.
Multi-tenancy / privacy (D13)Gold is dev-time; real copyrighted docs stay local. In-app feedback needs consent + a privacy-preserving path.
LLM-does-semantics / code-does-mechanicsMade testable: everything left of the grouping LLM is deterministic → gradeable offline.
Incremental re-ingestionThe reuse-cached-stages mechanism is what makes reprocessing affordable.
Local CI/CD for agentic codingGate mode runs in the existing ci.sh pre-commit gate; no cloud-CI dependency.

Decisions Log

DateDecisionRationaleAlternatives Considered
2026-06-02Harness is LLM-free in the Gate; deterministic stages graded offlineFast, reproducible, pre-commit-safeAlways include an LLM eval (too slow/costly/nondeterministic for a gate)
2026-06-02Code chunking is the default; the LLM is a per-region fallbackStructured authors already encode semantic units in numbering; code is more correct + ~free thereLLM-grouping everything (cost, nondeterminism, no better on structured)
2026-06-02Objective = minimize cost subject to a per-type quality floor; operate at the frontier kneePipeline controls cost; quality floor prevents cheating into garbage; knee is objectivePure cost-min (garbage); pure quality-max (unbounded cost)
2026-06-02Gold is span-anchored, not chunk-ID-anchoredLets the same gold score any chunking → config A/B is apples-to-applesChunk-ID gold (breaks across configs)
2026-06-02Quality measured as span-level retrieval recall@k, not LLM-judged answersDeterministic and truthful; no LLM in the gateAnswer-judge (nondeterministic); boundary-only (proxy, not truth)
2026-06-02Two expectation layers: snapshot (stability) + assertions (correctness); baseline ≠ blessedSnapshots catch drift; assertions claim only known-correctSnapshot-only (blesses current); assertion-only (misses drift)
2026-06-02Structurability judged by doc-relative signals, never absolute thresholdsFixed margins broke when a doc's layout shifted right (STEM Competition)Absolute x-thresholds (fragile)
2026-06-02 (RT)Gold query volume: ~30+/doc across types via seed-from-headings + LLM-generated hard queries with human-validated answer spansrecall@k on ~10 queries is noise; need volume without hand-authoring everythingHand-author all (too slow → caps corpus); tiny set (unreliable frontier)
2026-06-02 (RT)Gold anchored to section_id; on doc re-issue, revalidate only changed sections (a diff, not a re-curation)FIA docs re-issue monthly (Iss 06 → 18); text/page anchors rotFrozen per-issue snapshots (multiplies load); accept rot (frontier lies)
2026-06-02 (RT)Eval both per-doc-isolated (clean config A/B) and whole-corpus (realism check), reported as distinct metricsrecall@k depends on index distractor compositionWhole-corpus only (non-comparable history); isolated only (optimistic)
2026-06-02 (RT)Quality reprocessing never re-bills — grandfather the customer's chunk count; we absorb the recountChunk count is the D18 billing axis; never punish customers for our upgradeRe-bill new count (trust landmine); freeze-at-ingest (accounting drift)
2026-06-02 (RT)Baseline-first — current prod config is the first frontier point; every config reported as a delta vs itWithout a baseline 'improvement' is unprovable + a regression vs prod could ship unnoticedAbsolute-only metrics
2026-06-02 (RT)Modeled cost calibrated against recordedCostUsd with a published error band; recalibrate on pricing/model changeAn unvalidated 2× error makes the cost axis — the whole point — lieModeled-only (unvalidated); always-real (slow + costs LLM spend per config)
2026-06-02 (RT)Quality floor = 'no regression vs prod baseline' + a per-type lift, recalibrated with beta feedbackCalibrating from your own corpus is circular; prod baseline is a real referenceAbsolute target (arbitrary); defer (optimizer left unconstrained)
2026-06-02 (RT)Hierarchical (small-to-big) retrieval is an experimental config, promoted to core only if it wins on cross-ref/scoping queriesHierarchy is already recovered → cheap to test; don't commit schema/retrieval/inspector surface before measurementCore now (premature surface); defer (leaves cross-ref quality on the table)
2026-06-02 (RT)Config sweeps use hash-cached embeddings (reuse by chunk-text hash) + bounded config/corpus shortlist; full-corpus sweeps occasionalOnly chunks that actually changed re-embed; keeps the optimizer cheap to runRe-embed all per sweep (scales badly); cheap-model proxy (false winners)
2026-06-02 (RT)Operating rule = minimize cost subject to the per-type floor; the knee is a heuristic, the floor is the decision ruleA frontier may have no clean knee; the floor always yields a well-defined operating pointManual judgment (subjective); fixed cost budget (arbitrary)
2026-06-02 (RT)Probe emits a confidence; low-confidence docs route to the safe LLM path and are flagged for the corpusConfident misroutes ship garbage chunks; this degrades gracefully + feeds the growth loopTrust the probe (silent bad chunks); run both paths (doubles cost on the ambiguous docs)
2026-06-02 (RT)An Eval run gates pipeline-logic merges (baseline-delta + floor check); owner = whoever changes the pipeline (resolves OD8)Makes the optimizer non-skippable rather than a thing someone has to rememberScheduled sweeps (decoupled from changes); ad-hoc (never runs)

Known Issues / Tech Debt

IssueSeverityNotes
No composite pipeline fingerprint / segment_version in DBMedOnly extractor_version stamped; can't yet target reprocessing by config. Needed before reprocessing is real.
Goldens are pretty-printed → noisy multi-thousand-line diffs (FIA C ≈ 1,132 atoms)MedCompact JSON and/or digest large cache-only docs.
segment.ts not wired into the live pipelineBy design — validated in isolation; integration is a separate epic the harness gates.
Mixed numbering scheme within one doc (FIA C body C-numbered vs bare-numbered appendix)MedCaused a 22.9k-word "section" blob in the spike; needs per-region detection.
Pipeline not yet parameterized by a config objectMedRequired for Eval-mode A/B; scope of the change is OD10.

Sub-system docs define architectural boundaries and product-level capabilities. This one defines the evaluation + optimization surface every chunking/segmentation/routing/reprocessing decision depends on. If removed, we lose the ability to change the pipeline safely or tune it toward the cost/quality frontier.

Review

🔒

Enter your access token to view annotations