Pipeline Evaluation & Optimization Harness — Sub-system Design Doc
An LLM-free-in-the-gate evaluation + optimization platform over the ingestion pipeline (parse → structure → segment → chunk → embed → retrieve). It scores any pipeline configuration against curated, span-anchored gold-standard output on two axes — quality and cost — and surfaces the Pareto frontier so we can operate at the knee: lowest cost at acceptable quality. It grows along two axes over time (documents and queries), turning unforeseen edge cases into permanent regression guards.
Status: DRAFT v2 — for review + red-team/blue-team. Authored 2026-06-02.
Architecture
Overview
Risks & Constraints
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Gold-standard authoring is a slog and stalls the effort | Med | High | Curate-by-correction (emit current output → hand-fix → commit). Seed query gold from section headings. Start boundary-only, 3 docs. |
| Heuristics over-fit to the corpus (great in-sample, fail in the wild) | Med | Med | Grow corpus with adversarial edge cases; keep a hold-out set never used for tuning; score per doc-class and per question-type, not just aggregate. |
| Optimizer cheats into cheap-but-bad configs | Med | High | The objective is min-cost subject to a hard quality floor (per question-type). No floor → no optimization. |
| Reprocessing cost when an upgrade fixes quality for existing customers | High | Med | Versioned + targeted + incremental reprocessing (reuse cached upstream stages); prioritize docs below the floor + paying tiers. |
| Retrieval-eval mistaken as nondeterministic | — | — | Resolved: measure span-level recall@k (deterministic), not LLM-judged answer quality. Pin the embedding-model version. |
| Copyright / repo bloat as the corpus grows | Med | Low | Goldens carry ids/spans/counts only, no body text; source PDFs stay in gitignored cache/; compact golden JSON. |
| In-app feedback corpus uses tenant data without rights | Med | High | Explicit opt-in consent; privacy-preserving fallback (aggregate signal + synthesized equivalents) when rights are absent. |
Current Status
| Capability | Status |
|---|---|
L0 — segment.ts probe + section segmenter (classification + atoms) | Shipped (2026-06-02) |
L0 — segment.test.ts golden harness: snapshot + assertions, 12 docs / 4 classes, <1s, in pre-commit gate | Shipped |
| L1 — Boundary-set gold (span-anchored) + boundary P/R/F1 scorer | Planned |
| L2 — Region labels (tables/figures) + detection scorer | Planned |
| L3 — Retrieval gold (query → answer span) + recall@k scorer (embeddings) | Planned |
| Query-type taxonomy + per-type scorecards/floors | Planned |
| Config-parameterized pipeline + A/B frontier sweep (Eval mode) | Planned |
| Composite pipeline-version fingerprint + targeted/incremental reprocessing | Planned |
| In-app quality feedback → corpus (consented) | Planned (later) |
The Story
This session began as extraction-cost work (#49 — gate page images to figure/sparse pages). Diagnosing it pulled us into chunking: on FIA Section C the chunks came out page-sized. Root cause was the structure stage under-segmenting — it keys headings on font size, but FIA/STEM section headers render at the same height as body text, so every line classifies as body and coalesces into one page-sized atom. The real boundary signal is the document's own indentation + numbering, which the pipeline ignored.
A deterministic spike proved we can recover the full section hierarchy (C1–C18, 910+ sections) from geometry alone, and that documents fall into a few structural classes (STRUCTURED / FLAT-VERSE / PROSE) separable by cheap, doc-relative signals (sequentiality, header coverage, indent gap). That produced segment.ts and a golden-corpus harness.
The pivotal realization: the integration design questions — granularity, table/figure handling, routing, code-vs-LLM — can't be answered by argument, only by measurement. Chatting it through, the idea grew from a regression harness into a pipeline optimization platform: parameterize the pipeline, evaluate configs on quality + cost, and operate at the frontier knee. Chunking quality is the Autri wedge (the inspector's whole value is legible, correct extraction); this is the instrument that drives it, never regresses it, and self-improves as documents and real user queries flow in.
What Is This Sub-system?
A development-time evaluation + optimization platform over the ingestion pipeline. It owns a growing corpus of real + synthetic documents, curated span-anchored gold-standard outputs at escalating fidelity (boundaries → regions → retrieval), a query gold set organized by question-type, and the scorers that grade any pipeline configuration on quality and cost. It runs in two modes: a fast deterministic Gate (pre-commit) and an on-demand Eval that sweeps configs and emits the (quality, cost) frontier. It exists as its own layer because every chunking, segmentation, routing, and reprocessing decision downstream depends on it as the source of truth.
Chunking Philosophy: the LLM is a Fallback, not the Engine
The governing principle, settled by reasoning + the spike:
- Deterministic code is the default, and for structured technical docs it is not a compromise — it's more correct. A reg author already encoded the semantic units in the numbering (
C4.1 Minimum massis one rule, deliberately). We read that structure; we don't pay an LLM to re-infer it. - The unit is the authored section/subsection, not the line. "Every line = a chunk" overshoots into fragments: embeddings become underspecified and a single answer smears across chunks (the heading line and the number line separate). The semantic unit — the subsection (regs) or paragraph (prose) — is the target.
- The hierarchy is the asset. The segmenter recovers the whole tree (C3 → C3.5 → C3.5.1), which enables small-to-big retrieval: embed fine for precise matching, return the parent section for complete context. This may dissolve the granularity dilemma entirely (
embed at the semantic unit, display at the readable unit). - Selective, per-region LLM triggering. The router decides
code | llm | visionper region, not per document. A STRUCTURED doc is mostly code-chunked sections, with the LLM firing only on the spans that earn it. This is what lets cost track quality.
Where the LLM is genuinely irreplaceable — all cases where the document does not declare its own boundaries:
- Unstructured prose (novels, proposals) — ideal chunks span paragraphs in ways only semantics sees.
- Tables & figures — non-linear meaning; code detects the region, the model renders it to retrievable text (overlaps #49
needs_vision). - Edge repair — a giant authored section with no sub-numbering that's really many ideas (rare).
Where it is not needed (often mistaken as LLM-only): content inside one authored section stays together for free; syntactic continuation ("next paragraph starts mid-sentence, lowercase, no terminal punctuation") is a deterministic merge the pipeline already does. The LLM is reserved for semantic grouping in unstructured text — a far smaller surface than it first appears. Even then, the LLM decides boundaries; code does the mechanics.
The Objective: Minimize Cost Under a Quality Floor
"Maximize profit at high quality" makes precise as: the pipeline controls cost (LLM spend); revenue is set by pricing (D18); so the objective is to minimize cost subject to quality ≥ a floor.
- The quality/cost frontier has a knee — a point past which more spend buys almost no quality. That knee is a data fact, not a taste call (the objective is objective, per Dan). We operate at the knee.
- The quality floor is a hard constraint and must be a number (e.g. retrieval recall@k thresholds, set per question-type). Without it, "minimize cost" trivially → cheapest garbage.
- Asymmetry principle: chunking cost is one-time (ingestion); chunking quality is forever (every query the doc serves). Below the floor, never trade quality for the ingestion saving — the saving is paid once, bad answers recur indefinitely.
The harness's job: draw the frontier and locate the knee; the floor guards the optimizer from cheating.
Architecture Diagram
┌──────── configurable pipeline (the thing under test) ─────────┐
source ──▶ parse ─▶ structure ─▶ [probe] ─▶ router(config) ─per region─▶ { code | llm | vision } ─▶ chunks ─▶ embed
└───────────────────────────────────────────────────────────────┘ │ scored against ▼
┌──────────────────── Pipeline Eval & Optimization Harness ─────────────────────────────────────────────┐
│ CORPUS GOLD (span-anchored) SCORERS MODES │
│ ├ synthetic (CI) ├ L1 boundary sets ├ boundary P/R/F1 ├ GATE (CI, <1s, det.) │
│ └ real (cache) ├ L2 region labels ├ region detection │ boundaries+regions+route│
│ + query gold └ L3 query → answer span ├ retrieval recall@k └ EVAL (on-demand) │
│ (by question-type) └ cost model/meter config sweep → (Q,$) ▷ frontier
└───────────────────────────────────────────────────────────────────────────────────────────────────────┘
System Boundary
Inside: the corpus + manifest, all gold fixtures (boundary / region / query), the scorers, the cost model, the config-sweep runner, both run modes, and reporting.
Outside: the pipeline code itself — the harness grades it, doesn't own it. The production chunk schema, inspector, and retrieval live outside; the harness asserts against their contracts. The pipeline must expose a config surface (below) so the harness can drive it — that's the one requirement the harness imposes outward.
Key Interfaces
| Interface | Type | Consumers |
|---|---|---|
pipeline(config, doc) → chunks (parameterized, pluggable) | Function | Harness Eval mode; A/B sweeps |
probeAndSegment(pages) → {metrics, docClass, atoms} | Function (segment.ts) | Harness; the live router |
| Span-anchored gold fixtures (boundary / region / query) | Fixture | L1–L3 scorers |
score(actual, gold) → metrics + costModel(config, doc) → usd | Function | Gate + Eval suites |
| Composite pipeline fingerprint (per document) | DB column | Reprocessing targeting; staleness reporting |
--emit-gold / UPDATE_GOLDENS=1 | CLI mode | Curate-by-correction authoring |
The Gold-Standard Model (span-anchored)
The harness today validates stability (snapshots) + a few assertions. It does not yet encode what the right output is. Gold standard is that ground truth — and the critical design move is that gold is anchored to document positions (section_id / char-span), never to chunk IDs. Because chunk IDs differ per config, span-anchoring is what makes the same gold score any chunking — the precondition for A/B-ing configs apples-to-apples.
Three escalating layers:
- Boundary set (cheap, deterministic) —
[(section_id, page, anchor)]: where a chunk should begin. Scored by boundary P/R/F1. Cheap because the segmenter is ~90% right on STRUCTURED docs — we curate its output, not author from scratch. Answers granularity (merge/split rules). - Region labels (medium, deterministic) —
[(page, bbox, type)]for tables/figures. Reuses #49needs_visionas a candidate signal. Answers table/figure handling. - Retrieval gold (deepest; embeddings, no grouping-LLM) —
query → answer span(s). A retrieval is correct if any top-k chunk overlaps the gold span. Answers "is code as good as LLM" + granularity-for-retrieval — the real arbiter.
Authoring principle — curate-by-correction: --emit-gold dumps current output in gold format; a human fixes the known-wrong spots; commit. Marginal cost of a new gold doc stays small; the corpus grows organically as edge cases surface.
Measuring Quality (and why it stays deterministic)
The determinism worry resolves by separating two things often conflated:
| Layer | Measures | Deterministic? | When |
|---|---|---|---|
| Boundary F1 (vs span gold) | structural correctness (proxy) | ✅ fully | every commit (Gate) |
| Retrieval recall@k (vs answer-span gold) | true retrieval quality | ✅ at span/ID level — same model + text → same top-k | periodic / per-config sweep |
| Answer-quality (LLM judge) | end-to-end usefulness | ❌ | occasional spot-check, never a gate |
The key: measure whether the right chunk was retrieved (deterministic), not whether the generated answer is good (LLM-judged, nondeterministic). Pin the embedding-model version; float noise doesn't flip top-k. So the real quality arbiter (recall@k) drives the frontier without an LLM in the loop.
Query Gold & the Question-Type Taxonomy
Quality is meaningless without queries. We build the query set two ways and organize it by question-type — which turns one fuzzy number into an actionable scorecard:
| Question type | Example | Stresses |
|---|---|---|
| Lookup / factoid | "What's the minimum mass?" | basic recall + granularity |
| Vocabulary-mismatch | "How heavy must the car be?" | embedding robustness (favors small+context) |
| Cross-reference | "Mass limit and the penalty for it" | multi-span → argues for hierarchical retrieval |
| Scoping | "List all aerodynamic component rules" | needs the C3 subtree → tests the hierarchy |
| Definitional | "What counts as bodywork?" | are definitions retrievably chunked |
| Negative / out-of-scope | "Tire pressure rule?" (absent) | precision — not retrieving false relevance |
| Figure/table | "Dimensions in Figure 3?" | the vision path |
- Seed-from-headings (auto, breadth): every section → a query whose gold span is that section. Free coverage; proves recall.
- Hand-authored typed questions (depth, ~5–10/doc): the vocab-mismatch and cross-reference cases that actually discriminate between configs.
- Per-type scoring → per-type floors. Report recall by type (lookups must be ≥0.95; cross-reference ≥0.8). Weak cross-reference → adopt hierarchical; weak vocab-mismatch → contextual prefixes. Far more actionable than one aggregate.
- Queries are the second growth axis. In prod, a real query that retrieves badly gets categorized, span-labeled, and added as a permanent gold case (possibly a new type). Documents grow the corpus structurally; queries grow it for retrieval.
Pipeline Configurability & A/B (Eval mode)
For "import different pipeline versions and compare them," the pipeline must be parameterized by a config object that fully determines behavior — routing thresholds, chunk granularity, hierarchical-vs-flat, contextual-prefix on/off, embedding model, chunk-size targets. Then Eval mode sweeps the config matrix over the corpus, scoring each on quality (recall@k, per-type) and cost (modeled deterministically: LLM-routed regions × token estimate × price, plus #49 vision spend), and emits the (quality, cost) frontier. We read off the knee and ship that config. This is the optimization machine: same machinery as the gate, bigger ambition.
Versioning & Reprocessing
Current state (verified): documents.extractor_version and chunks.{extractor_version, embedder_version} are stamped (<prompt-version>/<model>). STRUCTURE_VERSION lives only in cached JSON, not the DB. There is no composite pipeline fingerprint and no segment_version — so we cannot today answer "which docs were ingested under config X."
Design addition — composite pipeline fingerprint per document: parse·structure·segment·extractor·embedder·routing-config-hash. Then "who needs reprocessing after fix X" = WHERE fingerprint predates X AND X's scope applies.
Reprocessing is targeted + incremental, never blanket:
- Targeted — only docs the upgrade actually affects (structured-chunking fix → only STRUCTURED docs; figure fix → only docs with figures).
- Incremental — reuse cached upstream stages (segmenter-only change → skip parse/OCR, reuse
words.json, re-run segment→chunk→embed). Dovetails with the Incremental Re-Ingestion sub-system; stacks with #49/#50. - Prioritized — docs currently below the quality floor first, paying tiers first. The floor makes prioritization objective.
In-App Quality Feedback → Corpus (later)
The prod-sourced query-growth loop: a 👎 on a chat answer captures (query, retrieved chunks, doc) → triage → tag with a question-type → add as a permanent gold case. Constraints: explicit consent to use tenant doc + query for pipeline development (D13 territory); a privacy-preserving fallback — even without doc rights, the aggregate signal (👎 + which chunks missed) is useful telemetry and we can synthesize equivalent gold without the sensitive content. Later feature; it's what makes the optimization machine self-improving in production.
Eval Dimensions → Design Questions
| Design question | Measured via | LLM-free? |
|---|---|---|
| Granularity / tiny-merge / huge-split | L1 boundary F1 | ✅ |
| Tables & figures inside structured docs | L2 region detection | ✅ |
| Chunk-record validity (bbox in-bounds, section_id) | schema validators | ✅ |
| Routing / WEAK-band threshold | docClass accuracy over labeled corpus | ✅ |
| Code-vs-LLM; how much the code path replaces | L3 recall@k: code vs LLM chunks, same span gold | ✅ (embeddings) |
| Hierarchical vs flat chunking | L3 recall@k by question-type (esp. cross-reference/scoping) | ✅ (embeddings) |
| Operating config | Eval-mode frontier knee under per-type floors | ✅ (cost modeled) |
Open Decisions (red-team / blue-team targets)
Resolved in the 2026-06-02 red-team (see Decisions Log): gold volume/authoring (OD1 partial), gold-rot-vs-versions, index scope, reprocessing billing (OD11), baseline-first, cost-model calibration, quality-floor calibration (OD5), hierarchical-as-experiment (OD9), sweep embedding cost, no-knee operating rule, probe-confidence fallback, Eval-gates-merge (OD8).
Still open — for the integration epic / blue-team:
- OD2 — Span-anchor serialization. Anchor basis is settled (section_id). Open: how to address a span within a section (char offset? paragraph index?) for retrieval gold, so it's precise without being brittle to re-parsing.
- OD6 — Answer-judge cadence. The LLM answer-quality spot-check is explicitly non-gate; open: how often it runs and what triggers it (pre-ship? per-release?).
- OD7 — Corpus growth + copyright at scale. Golden compaction (in tech debt), digesting large cache-only docs, and the policy for third-party-copyrighted docs as the corpus grows.
- OD10 — Pipeline config surface. What fields the config object exposes, and how invasive parameterizing the existing pipeline is to expose them (scope estimate needed before Eval mode is real).
- OD12 — Feedback-capture consent model. The opt-in UX for using a tenant's doc+query in the corpus, and the privacy-preserving fallback (aggregate signal + synthesized equivalents) when rights are absent. Later feature.
- OD3 — Hold-out discipline. First-curate docs are settled (clean reg / nasty FIA-C / negative prose). Open: the size + rotation of the tuning hold-out set so we measure over-fit, not just in-sample fit.
Related Epics
| Epic | Doc | Status | Summary |
|---|---|---|---|
| Segmenter integration (hybrid router) | (to be written) | Planned | Wire segment.ts into the live pipeline (per-region routing); harness gates it. |
| Incremental Re-Ingestion | sub-systems/incremental-re-ingestion.md | Existing | The reuse-cached-stages substrate targeted/incremental reprocessing builds on. |
| #49 image-gating | GitHub #49 | Shipped | Per-page needs_vision; the figure signal L2 reuses. |
| #50 cheaper-model spike | GitHub #50 | Planned | Code-driven chunking shrinks the LLM surface; compounds the cost win. |
Cross-Cutting Concerns
| Concern | How This Sub-system Is Affected |
|---|---|
| Cost (D16/D18) | Code-driven chunking removes most grouping-LLM spend; the harness proves quality holds before banking it, and prices the frontier. |
| Multi-tenancy / privacy (D13) | Gold is dev-time; real copyrighted docs stay local. In-app feedback needs consent + a privacy-preserving path. |
| LLM-does-semantics / code-does-mechanics | Made testable: everything left of the grouping LLM is deterministic → gradeable offline. |
| Incremental re-ingestion | The reuse-cached-stages mechanism is what makes reprocessing affordable. |
| Local CI/CD for agentic coding | Gate mode runs in the existing ci.sh pre-commit gate; no cloud-CI dependency. |
Decisions Log
| Date | Decision | Rationale | Alternatives Considered |
|---|---|---|---|
| 2026-06-02 | Harness is LLM-free in the Gate; deterministic stages graded offline | Fast, reproducible, pre-commit-safe | Always include an LLM eval (too slow/costly/nondeterministic for a gate) |
| 2026-06-02 | Code chunking is the default; the LLM is a per-region fallback | Structured authors already encode semantic units in numbering; code is more correct + ~free there | LLM-grouping everything (cost, nondeterminism, no better on structured) |
| 2026-06-02 | Objective = minimize cost subject to a per-type quality floor; operate at the frontier knee | Pipeline controls cost; quality floor prevents cheating into garbage; knee is objective | Pure cost-min (garbage); pure quality-max (unbounded cost) |
| 2026-06-02 | Gold is span-anchored, not chunk-ID-anchored | Lets the same gold score any chunking → config A/B is apples-to-apples | Chunk-ID gold (breaks across configs) |
| 2026-06-02 | Quality measured as span-level retrieval recall@k, not LLM-judged answers | Deterministic and truthful; no LLM in the gate | Answer-judge (nondeterministic); boundary-only (proxy, not truth) |
| 2026-06-02 | Two expectation layers: snapshot (stability) + assertions (correctness); baseline ≠ blessed | Snapshots catch drift; assertions claim only known-correct | Snapshot-only (blesses current); assertion-only (misses drift) |
| 2026-06-02 | Structurability judged by doc-relative signals, never absolute thresholds | Fixed margins broke when a doc's layout shifted right (STEM Competition) | Absolute x-thresholds (fragile) |
| 2026-06-02 (RT) | Gold query volume: ~30+/doc across types via seed-from-headings + LLM-generated hard queries with human-validated answer spans | recall@k on ~10 queries is noise; need volume without hand-authoring everything | Hand-author all (too slow → caps corpus); tiny set (unreliable frontier) |
| 2026-06-02 (RT) | Gold anchored to section_id; on doc re-issue, revalidate only changed sections (a diff, not a re-curation) | FIA docs re-issue monthly (Iss 06 → 18); text/page anchors rot | Frozen per-issue snapshots (multiplies load); accept rot (frontier lies) |
| 2026-06-02 (RT) | Eval both per-doc-isolated (clean config A/B) and whole-corpus (realism check), reported as distinct metrics | recall@k depends on index distractor composition | Whole-corpus only (non-comparable history); isolated only (optimistic) |
| 2026-06-02 (RT) | Quality reprocessing never re-bills — grandfather the customer's chunk count; we absorb the recount | Chunk count is the D18 billing axis; never punish customers for our upgrade | Re-bill new count (trust landmine); freeze-at-ingest (accounting drift) |
| 2026-06-02 (RT) | Baseline-first — current prod config is the first frontier point; every config reported as a delta vs it | Without a baseline 'improvement' is unprovable + a regression vs prod could ship unnoticed | Absolute-only metrics |
| 2026-06-02 (RT) | Modeled cost calibrated against recordedCostUsd with a published error band; recalibrate on pricing/model change | An unvalidated 2× error makes the cost axis — the whole point — lie | Modeled-only (unvalidated); always-real (slow + costs LLM spend per config) |
| 2026-06-02 (RT) | Quality floor = 'no regression vs prod baseline' + a per-type lift, recalibrated with beta feedback | Calibrating from your own corpus is circular; prod baseline is a real reference | Absolute target (arbitrary); defer (optimizer left unconstrained) |
| 2026-06-02 (RT) | Hierarchical (small-to-big) retrieval is an experimental config, promoted to core only if it wins on cross-ref/scoping queries | Hierarchy is already recovered → cheap to test; don't commit schema/retrieval/inspector surface before measurement | Core now (premature surface); defer (leaves cross-ref quality on the table) |
| 2026-06-02 (RT) | Config sweeps use hash-cached embeddings (reuse by chunk-text hash) + bounded config/corpus shortlist; full-corpus sweeps occasional | Only chunks that actually changed re-embed; keeps the optimizer cheap to run | Re-embed all per sweep (scales badly); cheap-model proxy (false winners) |
| 2026-06-02 (RT) | Operating rule = minimize cost subject to the per-type floor; the knee is a heuristic, the floor is the decision rule | A frontier may have no clean knee; the floor always yields a well-defined operating point | Manual judgment (subjective); fixed cost budget (arbitrary) |
| 2026-06-02 (RT) | Probe emits a confidence; low-confidence docs route to the safe LLM path and are flagged for the corpus | Confident misroutes ship garbage chunks; this degrades gracefully + feeds the growth loop | Trust the probe (silent bad chunks); run both paths (doubles cost on the ambiguous docs) |
| 2026-06-02 (RT) | An Eval run gates pipeline-logic merges (baseline-delta + floor check); owner = whoever changes the pipeline (resolves OD8) | Makes the optimizer non-skippable rather than a thing someone has to remember | Scheduled sweeps (decoupled from changes); ad-hoc (never runs) |
Known Issues / Tech Debt
| Issue | Severity | Notes |
|---|---|---|
No composite pipeline fingerprint / segment_version in DB | Med | Only extractor_version stamped; can't yet target reprocessing by config. Needed before reprocessing is real. |
| Goldens are pretty-printed → noisy multi-thousand-line diffs (FIA C ≈ 1,132 atoms) | Med | Compact JSON and/or digest large cache-only docs. |
segment.ts not wired into the live pipeline | — | By design — validated in isolation; integration is a separate epic the harness gates. |
Mixed numbering scheme within one doc (FIA C body C-numbered vs bare-numbered appendix) | Med | Caused a 22.9k-word "section" blob in the spike; needs per-region detection. |
| Pipeline not yet parameterized by a config object | Med | Required for Eval-mode A/B; scope of the change is OD10. |
Sub-system docs define architectural boundaries and product-level capabilities. This one defines the evaluation + optimization surface every chunking/segmentation/routing/reprocessing decision depends on. If removed, we lose the ability to change the pipeline safely or tune it toward the cost/quality frontier.