Pipeline Evaluation & Optimization Harness — Sub-system Design Doc

An LLM-free-in-the-gate evaluation + optimization platform over the ingestion pipeline (parse → structure → segment → chunk → embed → retrieve). It scores any pipeline configuration against curated, span-anchored gold-standard output on two axes — quality and cost — and surfaces the Pareto frontier so we can operate at the knee: lowest cost at acceptable quality. It grows along two axes over time (documents and queries), turning unforeseen edge cases into permanent regression guards.

Status: DRAFT v2 — for review + red-team/blue-team. Authored 2026-06-02.

Architecture

Overview

Risks & Constraints

Risk	Likelihood	Impact	Mitigation
Gold-standard authoring is a slog and stalls the effort	Med	High	Curate-by-correction (emit current output → hand-fix → commit). Seed query gold from section headings. Start boundary-only, 3 docs.
Heuristics over-fit to the corpus (great in-sample, fail in the wild)	Med	Med	Grow corpus with adversarial edge cases; keep a hold-out set never used for tuning; score per doc-class and per question-type, not just aggregate.
Optimizer cheats into cheap-but-bad configs	Med	High	The objective is min-cost subject to a hard quality floor (per question-type). No floor → no optimization.
Reprocessing cost when an upgrade fixes quality for existing customers	High	Med	Versioned + targeted + incremental reprocessing (reuse cached upstream stages); prioritize docs below the floor + paying tiers.
Retrieval-eval mistaken as nondeterministic	—	—	Resolved: measure span-level recall@k (deterministic), not LLM-judged answer quality. Pin the embedding-model version.
Copyright / repo bloat as the corpus grows	Med	Low	Goldens carry ids/spans/counts only, no body text; source PDFs stay in gitignored `cache/`; compact golden JSON.
In-app feedback corpus uses tenant data without rights	Med	High	Explicit opt-in consent; privacy-preserving fallback (aggregate signal + synthesized equivalents) when rights are absent.

Current Status

Capability	Status
L0 — `segment.ts` probe + section segmenter (classification + atoms)	Shipped (2026-06-02)
L0 — `segment.test.ts` golden harness: snapshot + assertions, 12 docs / 4 classes, <1s, in pre-commit gate	Shipped
L1 — Boundary-set gold (span-anchored) + boundary P/R/F1 scorer	Planned
L2 — Region labels (tables/figures) + detection scorer	Planned
L3 — Retrieval gold (query → answer span) + recall@k scorer (embeddings)	Planned
Query-type taxonomy + per-type scorecards/floors	Planned
Config-parameterized pipeline + A/B frontier sweep (Eval mode)	Planned
Composite pipeline-version fingerprint + targeted/incremental reprocessing	Planned
In-app quality feedback → corpus (consented)	Planned (later)

The Story

This session began as extraction-cost work (#49 — gate page images to figure/sparse pages). Diagnosing it pulled us into chunking: on FIA Section C the chunks came out page-sized. Root cause was the structure stage under-segmenting — it keys headings on font size, but FIA/STEM section headers render at the same height as body text, so every line classifies as body and coalesces into one page-sized atom. The real boundary signal is the document's own indentation + numbering, which the pipeline ignored.

A deterministic spike proved we can recover the full section hierarchy (C1–C18, 910+ sections) from geometry alone, and that documents fall into a few structural classes (STRUCTURED / FLAT-VERSE / PROSE) separable by cheap, doc-relative signals (sequentiality, header coverage, indent gap). That produced segment.ts and a golden-corpus harness.

The pivotal realization: the integration design questions — granularity, table/figure handling, routing, code-vs-LLM — can't be answered by argument, only by measurement. Chatting it through, the idea grew from a regression harness into a pipeline optimization platform: parameterize the pipeline, evaluate configs on quality + cost, and operate at the frontier knee. Chunking quality is the Autri wedge (the inspector's whole value is legible, correct extraction); this is the instrument that drives it, never regresses it, and self-improves as documents and real user queries flow in.

What Is This Sub-system?

A development-time evaluation + optimization platform over the ingestion pipeline. It owns a growing corpus of real + synthetic documents, curated span-anchored gold-standard outputs at escalating fidelity (boundaries → regions → retrieval), a query gold set organized by question-type, and the scorers that grade any pipeline configuration on quality and cost. It runs in two modes: a fast deterministic Gate (pre-commit) and an on-demand Eval that sweeps configs and emits the (quality, cost) frontier. It exists as its own layer because every chunking, segmentation, routing, and reprocessing decision downstream depends on it as the source of truth.

Chunking Philosophy: the LLM is a Fallback, not the Engine

The governing principle, settled by reasoning + the spike:

Deterministic code is the default, and for structured technical docs it is not a compromise — it's more correct. A reg author already encoded the semantic units in the numbering (C4.1 Minimum mass is one rule, deliberately). We read that structure; we don't pay an LLM to re-infer it.
The unit is the authored section/subsection, not the line. "Every line = a chunk" overshoots into fragments: embeddings become underspecified and a single answer smears across chunks (the heading line and the number line separate). The semantic unit — the subsection (regs) or paragraph (prose) — is the target.
The hierarchy is the asset. The segmenter recovers the whole tree (C3 → C3.5 → C3.5.1), which enables small-to-big retrieval: embed fine for precise matching, return the parent section for complete context. This may dissolve the granularity dilemma entirely (embed at the semantic unit, display at the readable unit).
Selective, per-region LLM triggering. The router decides code | llm | vision per region, not per document. A STRUCTURED doc is mostly code-chunked sections, with the LLM firing only on the spans that earn it. This is what lets cost track quality.

Where the LLM is genuinely irreplaceable — all cases where the document does not declare its own boundaries:

Unstructured prose (novels, proposals) — ideal chunks span paragraphs in ways only semantics sees.
Tables & figures — non-linear meaning; code detects the region, the model renders it to retrievable text (overlaps #49 needs_vision).
Edge repair — a giant authored section with no sub-numbering that's really many ideas (rare).

Where it is not needed (often mistaken as LLM-only): content inside one authored section stays together for free; syntactic continuation ("next paragraph starts mid-sentence, lowercase, no terminal punctuation") is a deterministic merge the pipeline already does. The LLM is reserved for semantic grouping in unstructured text — a far smaller surface than it first appears. Even then, the LLM decides boundaries; code does the mechanics.

The Objective: Minimize Cost Under a Quality Floor

"Maximize profit at high quality" makes precise as: the pipeline controls cost (LLM spend); revenue is set by pricing (D18); so the objective is to minimize cost subject to quality ≥ a floor.

The quality/cost frontier has a knee — a point past which more spend buys almost no quality. That knee is a data fact, not a taste call (the objective is objective, per Dan). We operate at the knee.
The quality floor is a hard constraint and must be a number (e.g. retrieval recall@k thresholds, set per question-type). Without it, "minimize cost" trivially → cheapest garbage.
Asymmetry principle: chunking cost is one-time (ingestion); chunking quality is forever (every query the doc serves). Below the floor, never trade quality for the ingestion saving — the saving is paid once, bad answers recur indefinitely.

The harness's job: draw the frontier and locate the knee; the floor guards the optimizer from cheating.

Architecture Diagram

            ┌──────── configurable pipeline (the thing under test) ─────────┐
 source ──▶ parse ─▶ structure ─▶ [probe] ─▶ router(config) ─per region─▶ { code | llm | vision } ─▶ chunks ─▶ embed
            └───────────────────────────────────────────────────────────────┘     │ scored against ▼
 ┌──────────────────── Pipeline Eval & Optimization Harness ─────────────────────────────────────────────┐
 │  CORPUS              GOLD (span-anchored)            SCORERS                MODES                       │
 │  ├ synthetic (CI)    ├ L1 boundary sets             ├ boundary P/R/F1      ├ GATE  (CI, <1s, det.)     │
 │  └ real (cache)      ├ L2 region labels             ├ region detection     │   boundaries+regions+route│
 │  + query gold        └ L3 query → answer span       ├ retrieval recall@k   └ EVAL  (on-demand)         │
 │    (by question-type)                               └ cost model/meter         config sweep → (Q,$) ▷ frontier
 └───────────────────────────────────────────────────────────────────────────────────────────────────────┘

System Boundary

Inside: the corpus + manifest, all gold fixtures (boundary / region / query), the scorers, the cost model, the config-sweep runner, both run modes, and reporting.

Outside: the pipeline code itself — the harness grades it, doesn't own it. The production chunk schema, inspector, and retrieval live outside; the harness asserts against their contracts. The pipeline must expose a config surface (below) so the harness can drive it — that's the one requirement the harness imposes outward.

Key Interfaces

Interface	Type	Consumers
`pipeline(config, doc) → chunks` (parameterized, pluggable)	Function	Harness Eval mode; A/B sweeps
`probeAndSegment(pages) → {metrics, docClass, atoms}`	Function (`segment.ts`)	Harness; the live router
Span-anchored gold fixtures (boundary / region / query)	Fixture	L1–L3 scorers
`score(actual, gold) → metrics` + `costModel(config, doc) → usd`	Function	Gate + Eval suites
Composite pipeline fingerprint (per document)	DB column	Reprocessing targeting; staleness reporting
`--emit-gold` / `UPDATE_GOLDENS=1`	CLI mode	Curate-by-correction authoring

The Gold-Standard Model (span-anchored)

The harness today validates stability (snapshots) + a few assertions. It does not yet encode what the right output is. Gold standard is that ground truth — and the critical design move is that gold is anchored to document positions (section_id / char-span), never to chunk IDs. Because chunk IDs differ per config, span-anchoring is what makes the same gold score any chunking — the precondition for A/B-ing configs apples-to-apples.

Three escalating layers:

Boundary set (cheap, deterministic) — [(section_id, page, anchor)]: where a chunk should begin. Scored by boundary P/R/F1. Cheap because the segmenter is ~90% right on STRUCTURED docs — we curate its output, not author from scratch. Answers granularity (merge/split rules).
Region labels (medium, deterministic) — [(page, bbox, type)] for tables/figures. Reuses #49 needs_vision as a candidate signal. Answers table/figure handling.
Retrieval gold (deepest; embeddings, no grouping-LLM) — query → answer span(s). A retrieval is correct if any top-k chunk overlaps the gold span. Answers "is code as good as LLM" + granularity-for-retrieval — the real arbiter.

Authoring principle — curate-by-correction: --emit-gold dumps current output in gold format; a human fixes the known-wrong spots; commit. Marginal cost of a new gold doc stays small; the corpus grows organically as edge cases surface.

Measuring Quality (and why it stays deterministic)

The determinism worry resolves by separating two things often conflated:

Layer	Measures	Deterministic?	When
Boundary F1 (vs span gold)	structural correctness (proxy)	✅ fully	every commit (Gate)
Retrieval recall@k (vs answer-span gold)	true retrieval quality	✅ at span/ID level — same model + text → same top-k	periodic / per-config sweep
Answer-quality (LLM judge)	end-to-end usefulness	❌	occasional spot-check, never a gate

The key: measure whether the right chunk was retrieved (deterministic), not whether the generated answer is good (LLM-judged, nondeterministic). Pin the embedding-model version; float noise doesn't flip top-k. So the real quality arbiter (recall@k) drives the frontier without an LLM in the loop.

Query Gold & the Question-Type Taxonomy

Quality is meaningless without queries. We build the query set two ways and organize it by question-type — which turns one fuzzy number into an actionable scorecard:

Question type	Example	Stresses
Lookup / factoid	"What's the minimum mass?"	basic recall + granularity
Vocabulary-mismatch	"How heavy must the car be?"	embedding robustness (favors small+context)
Cross-reference	"Mass limit and the penalty for it"	multi-span → argues for hierarchical retrieval
Scoping	"List all aerodynamic component rules"	needs the C3 subtree → tests the hierarchy
Definitional	"What counts as bodywork?"	are definitions retrievably chunked
Negative / out-of-scope	"Tire pressure rule?" (absent)	precision — not retrieving false relevance
Figure/table	"Dimensions in Figure 3?"	the vision path

Seed-from-headings (auto, breadth): every section → a query whose gold span is that section. Free coverage; proves recall.
Hand-authored typed questions (depth, ~5–10/doc): the vocab-mismatch and cross-reference cases that actually discriminate between configs.
Per-type scoring → per-type floors. Report recall by type (lookups must be ≥0.95; cross-reference ≥0.8). Weak cross-reference → adopt hierarchical; weak vocab-mismatch → contextual prefixes. Far more actionable than one aggregate.
Queries are the second growth axis. In prod, a real query that retrieves badly gets categorized, span-labeled, and added as a permanent gold case (possibly a new type). Documents grow the corpus structurally; queries grow it for retrieval.

Pipeline Configurability & A/B (Eval mode)

For "import different pipeline versions and compare them," the pipeline must be parameterized by a config object that fully determines behavior — routing thresholds, chunk granularity, hierarchical-vs-flat, contextual-prefix on/off, embedding model, chunk-size targets. Then Eval mode sweeps the config matrix over the corpus, scoring each on quality (recall@k, per-type) and cost (modeled deterministically: LLM-routed regions × token estimate × price, plus #49 vision spend), and emits the (quality, cost) frontier. We read off the knee and ship that config. This is the optimization machine: same machinery as the gate, bigger ambition.

Versioning & Reprocessing

Current state (verified): documents.extractor_version and chunks.{extractor_version, embedder_version} are stamped (<prompt-version>/<model>). STRUCTURE_VERSION lives only in cached JSON, not the DB. There is no composite pipeline fingerprint and no segment_version — so we cannot today answer "which docs were ingested under config X."

Design addition — composite pipeline fingerprint per document: parse·structure·segment·extractor·embedder·routing-config-hash. Then "who needs reprocessing after fix X" = WHERE fingerprint predates X AND X's scope applies.

Reprocessing is targeted + incremental, never blanket:

Targeted — only docs the upgrade actually affects (structured-chunking fix → only STRUCTURED docs; figure fix → only docs with figures).
Incremental — reuse cached upstream stages (segmenter-only change → skip parse/OCR, reuse words.json, re-run segment→chunk→embed). Dovetails with the Incremental Re-Ingestion sub-system; stacks with #49/#50.
Prioritized — docs currently below the quality floor first, paying tiers first. The floor makes prioritization objective.

In-App Quality Feedback → Corpus (later)

The prod-sourced query-growth loop: a 👎 on a chat answer captures (query, retrieved chunks, doc) → triage → tag with a question-type → add as a permanent gold case. Constraints: explicit consent to use tenant doc + query for pipeline development (D13 territory); a privacy-preserving fallback — even without doc rights, the aggregate signal (👎 + which chunks missed) is useful telemetry and we can synthesize equivalent gold without the sensitive content. Later feature; it's what makes the optimization machine self-improving in production.

Eval Dimensions → Design Questions

Design question	Measured via	LLM-free?
Granularity / tiny-merge / huge-split	L1 boundary F1	✅
Tables & figures inside structured docs	L2 region detection	✅
Chunk-record validity (bbox in-bounds, section_id)	schema validators	✅
Routing / WEAK-band threshold	docClass accuracy over labeled corpus	✅
Code-vs-LLM; how much the code path replaces	L3 recall@k: code vs LLM chunks, same span gold	✅ (embeddings)
Hierarchical vs flat chunking	L3 recall@k by question-type (esp. cross-reference/scoping)	✅ (embeddings)
Operating config	Eval-mode frontier knee under per-type floors	✅ (cost modeled)

Open Decisions (red-team / blue-team targets)

Resolved in the 2026-06-02 red-team (see Decisions Log): gold volume/authoring (OD1 partial), gold-rot-vs-versions, index scope, reprocessing billing (OD11), baseline-first, cost-model calibration, quality-floor calibration (OD5), hierarchical-as-experiment (OD9), sweep embedding cost, no-knee operating rule, probe-confidence fallback, Eval-gates-merge (OD8).

Still open — for the integration epic / blue-team:

OD2 — Span-anchor serialization. Anchor basis is settled (section_id). Open: how to address a span within a section (char offset? paragraph index?) for retrieval gold, so it's precise without being brittle to re-parsing.
OD6 — Answer-judge cadence. The LLM answer-quality spot-check is explicitly non-gate; open: how often it runs and what triggers it (pre-ship? per-release?).
OD7 — Corpus growth + copyright at scale. Golden compaction (in tech debt), digesting large cache-only docs, and the policy for third-party-copyrighted docs as the corpus grows.
OD10 — Pipeline config surface. What fields the config object exposes, and how invasive parameterizing the existing pipeline is to expose them (scope estimate needed before Eval mode is real).
OD12 — Feedback-capture consent model. The opt-in UX for using a tenant's doc+query in the corpus, and the privacy-preserving fallback (aggregate signal + synthesized equivalents) when rights are absent. Later feature.
OD3 — Hold-out discipline. First-curate docs are settled (clean reg / nasty FIA-C / negative prose). Open: the size + rotation of the tuning hold-out set so we measure over-fit, not just in-sample fit.

Epic	Doc	Status	Summary
Segmenter integration (hybrid router)	(to be written)	Planned	Wire `segment.ts` into the live pipeline (per-region routing); harness gates it.
Incremental Re-Ingestion	`sub-systems/incremental-re-ingestion.md`	Existing	The reuse-cached-stages substrate targeted/incremental reprocessing builds on.
#49 image-gating	GitHub #49	Shipped	Per-page `needs_vision`; the figure signal L2 reuses.
#50 cheaper-model spike	GitHub #50	Planned	Code-driven chunking shrinks the LLM surface; compounds the cost win.

Cross-Cutting Concerns

Concern	How This Sub-system Is Affected
Cost (D16/D18)	Code-driven chunking removes most grouping-LLM spend; the harness proves quality holds before banking it, and prices the frontier.
Multi-tenancy / privacy (D13)	Gold is dev-time; real copyrighted docs stay local. In-app feedback needs consent + a privacy-preserving path.
LLM-does-semantics / code-does-mechanics	Made testable: everything left of the grouping LLM is deterministic → gradeable offline.
Incremental re-ingestion	The reuse-cached-stages mechanism is what makes reprocessing affordable.
Local CI/CD for agentic coding	Gate mode runs in the existing `ci.sh` pre-commit gate; no cloud-CI dependency.

Decisions Log

Date	Decision	Rationale	Alternatives Considered
2026-06-02	Harness is LLM-free in the Gate; deterministic stages graded offline	Fast, reproducible, pre-commit-safe	Always include an LLM eval (too slow/costly/nondeterministic for a gate)
2026-06-02	Code chunking is the default; the LLM is a per-region fallback	Structured authors already encode semantic units in numbering; code is more correct + ~free there	LLM-grouping everything (cost, nondeterminism, no better on structured)
2026-06-02	Objective = minimize cost subject to a per-type quality floor; operate at the frontier knee	Pipeline controls cost; quality floor prevents cheating into garbage; knee is objective	Pure cost-min (garbage); pure quality-max (unbounded cost)
2026-06-02	Gold is span-anchored, not chunk-ID-anchored	Lets the same gold score any chunking → config A/B is apples-to-apples	Chunk-ID gold (breaks across configs)
2026-06-02	Quality measured as span-level retrieval recall@k, not LLM-judged answers	Deterministic and truthful; no LLM in the gate	Answer-judge (nondeterministic); boundary-only (proxy, not truth)
2026-06-02	Two expectation layers: snapshot (stability) + assertions (correctness); baseline ≠ blessed	Snapshots catch drift; assertions claim only known-correct	Snapshot-only (blesses current); assertion-only (misses drift)
2026-06-02	Structurability judged by doc-relative signals, never absolute thresholds	Fixed margins broke when a doc's layout shifted right (STEM Competition)	Absolute x-thresholds (fragile)
2026-06-02 (RT)	Gold query volume: ~30+/doc across types via seed-from-headings + LLM-generated hard queries with human-validated answer spans	recall@k on ~10 queries is noise; need volume without hand-authoring everything	Hand-author all (too slow → caps corpus); tiny set (unreliable frontier)
2026-06-02 (RT)	Gold anchored to section_id; on doc re-issue, revalidate only changed sections (a diff, not a re-curation)	FIA docs re-issue monthly (Iss 06 → 18); text/page anchors rot	Frozen per-issue snapshots (multiplies load); accept rot (frontier lies)
2026-06-02 (RT)	Eval both per-doc-isolated (clean config A/B) and whole-corpus (realism check), reported as distinct metrics	recall@k depends on index distractor composition	Whole-corpus only (non-comparable history); isolated only (optimistic)
2026-06-02 (RT)	Quality reprocessing never re-bills — grandfather the customer's chunk count; we absorb the recount	Chunk count is the D18 billing axis; never punish customers for our upgrade	Re-bill new count (trust landmine); freeze-at-ingest (accounting drift)
2026-06-02 (RT)	Baseline-first — current prod config is the first frontier point; every config reported as a delta vs it	Without a baseline 'improvement' is unprovable + a regression vs prod could ship unnoticed	Absolute-only metrics
2026-06-02 (RT)	Modeled cost calibrated against `recordedCostUsd` with a published error band; recalibrate on pricing/model change	An unvalidated 2× error makes the cost axis — the whole point — lie	Modeled-only (unvalidated); always-real (slow + costs LLM spend per config)
2026-06-02 (RT)	Quality floor = 'no regression vs prod baseline' + a per-type lift, recalibrated with beta feedback	Calibrating from your own corpus is circular; prod baseline is a real reference	Absolute target (arbitrary); defer (optimizer left unconstrained)
2026-06-02 (RT)	Hierarchical (small-to-big) retrieval is an experimental config, promoted to core only if it wins on cross-ref/scoping queries	Hierarchy is already recovered → cheap to test; don't commit schema/retrieval/inspector surface before measurement	Core now (premature surface); defer (leaves cross-ref quality on the table)
2026-06-02 (RT)	Config sweeps use hash-cached embeddings (reuse by chunk-text hash) + bounded config/corpus shortlist; full-corpus sweeps occasional	Only chunks that actually changed re-embed; keeps the optimizer cheap to run	Re-embed all per sweep (scales badly); cheap-model proxy (false winners)
2026-06-02 (RT)	Operating rule = minimize cost subject to the per-type floor; the knee is a heuristic, the floor is the decision rule	A frontier may have no clean knee; the floor always yields a well-defined operating point	Manual judgment (subjective); fixed cost budget (arbitrary)
2026-06-02 (RT)	Probe emits a confidence; low-confidence docs route to the safe LLM path and are flagged for the corpus	Confident misroutes ship garbage chunks; this degrades gracefully + feeds the growth loop	Trust the probe (silent bad chunks); run both paths (doubles cost on the ambiguous docs)
2026-06-02 (RT)	An Eval run gates pipeline-logic merges (baseline-delta + floor check); owner = whoever changes the pipeline (resolves OD8)	Makes the optimizer non-skippable rather than a thing someone has to remember	Scheduled sweeps (decoupled from changes); ad-hoc (never runs)

Known Issues / Tech Debt

Issue	Severity	Notes
No composite pipeline fingerprint / `segment_version` in DB	Med	Only `extractor_version` stamped; can't yet target reprocessing by config. Needed before reprocessing is real.
Goldens are pretty-printed → noisy multi-thousand-line diffs (FIA C ≈ 1,132 atoms)	Med	Compact JSON and/or digest large cache-only docs.
`segment.ts` not wired into the live pipeline	—	By design — validated in isolation; integration is a separate epic the harness gates.
Mixed numbering scheme within one doc (FIA C body `C`-numbered vs bare-numbered appendix)	Med	Caused a 22.9k-word "section" blob in the spike; needs per-region detection.
Pipeline not yet parameterized by a config object	Med	Required for Eval-mode A/B; scope of the change is OD10.

Sub-system docs define architectural boundaries and product-level capabilities. This one defines the evaluation + optimization surface every chunking/segmentation/routing/reprocessing decision depends on. If removed, we lose the ability to change the pipeline safely or tune it toward the cost/quality frontier.

Pipeline Evaluation & Optimization Harness — Sub-system Design Doc#

Architecture#

Overview#

Risks & Constraints#

Current Status#

The Story#

What Is This Sub-system?#

Chunking Philosophy: the LLM is a Fallback, not the Engine#

The Objective: Minimize Cost Under a Quality Floor#

Architecture Diagram#

System Boundary#

Key Interfaces#

The Gold-Standard Model (span-anchored)#

Measuring Quality (and why it stays deterministic)#

Query Gold & the Question-Type Taxonomy#

Pipeline Configurability & A/B (Eval mode)#

Versioning & Reprocessing#

In-App Quality Feedback → Corpus (later)#

Eval Dimensions → Design Questions#

Open Decisions (red-team / blue-team targets)#

Related Epics#

Cross-Cutting Concerns#

Decisions Log#

Known Issues / Tech Debt#

Review