Pipeline Eval Harness — v1 Requirements
The scope contract for the first build of the Pipeline Evaluation & Optimization Harness. Produced by the 2026-06-02 red-team → blue-team pass. Grounds the build; everything not listed as v1 is explicitly deferred or out.
Design doc: sub-systems/pipeline-eval-harness.md · Decisions Log there is authoritative for rationale.
Scope at a glance
v1's job: the leanest harness that can (a) tune the segmenter against ground truth and (b) decide code-vs-LLM for the segmenter-integration epic — without building the full optimization machine. The deterministic layers ship first; the frontier sweep waits until the foundation is proven.
| Tier | What | In v1? |
|---|---|---|
| L1 boundary eval | section_id-anchored gold + P/R/F1 scorer | ✅ MUST |
| Curate-by-correction | --emit-gold authoring | ✅ MUST |
| Baseline | current prod chunking measured first | ✅ MUST |
| Corpus | 3 curated docs + hold-out | ✅ MUST |
| L3 retrieval eval | pairwise A/B (not sweep): query gold + recall@k | ✅ v1 (SHOULD→in) |
| Cost axis | reuse recordedCostUsd (exact on real runs) | ✅ v1 (lightweight) |
| Pipeline fingerprint | composite version column + stamp on ingest | ✅ v1 |
| Config-parameterized pipeline + Eval sweep | the optimization machine | ⛔ v2 |
| Modeled cost estimator + calibration | predictive cost for sweeps | ⛔ v2 |
| Hierarchical retrieval | experimental config | ⛔ v2 |
| L2 region/table eval · reprocessing pipeline · in-app feedback | — | ⛔ out (later) |
v1 Requirements (build)
R1 — L1 boundary eval. A boundary scorer comparing segmenter output to curated gold. Gold is section_id-anchored ([(section_id, page)]); scoring reports precision / recall / F1 of chunk-start boundaries. Runs in the CI gate (<a few seconds).
Acceptance: a segmenter change re-scores all corpus docs; F1 deltas are visible per doc.
R2 — Curate-by-correction. An --emit-gold mode that dumps current segmenter output in the gold format for hand-correction; corrected file is committed as gold. Goldens carry ids/spans/counts only — no body text.
Acceptance: adding a new gold doc is a ≤1-command emit + manual fix, no from-scratch authoring.
R3 — Baseline-first. The current production chunking config is measured as frontier point zero; every candidate is reported as a delta vs baseline. Acceptance: "is this better than prod today?" is answerable for boundary F1 and recall@k.
R4 — Corpus. Three curated docs — one clean reg (FIA-B or STEM Competition), one nasty (FIA-C: appendix + tables + mixed scheme), one negative (prose/verse) — plus a tuning hold-out never used to tune. Registered in manifest.json with gold.
Acceptance: the 3 docs + hold-out run green; over-fit is observable (hold-out scored separately).
R5 — L3 retrieval eval (pairwise A/B). Query gold = query → answer span(s), span-anchored to section_id, organized by question-type (lookup, vocab-mismatch, cross-reference, scoping, definitional, negative, figure/table). Authored by seed-from-headings (breadth) + ~10 hand-authored hard queries/doc (depth), LLM-generated candidates with human-validated spans; target ~30+/doc. The scorer embeds two chunkings (current vs candidate), runs queries, reports recall@k per question-type, evaluated under both per-doc-isolated and whole-corpus index scopes. No config-sweep machinery — direct A-vs-B.
Acceptance: one real code-vs-LLM comparison yields per-type recall@k under both index scopes.
R6 — Cost axis (lightweight). Cost reported from the existing recordedCostUsd summed over a real run (LLM path) plus embedding spend (code path). No predictive model.
Acceptance: each A/B comparison emits an exact cost number alongside quality.
R7 — Pipeline version fingerprint. A composite per-document fingerprint (parse·structure·segment·extractor·embedder·routing-config-hash) added as a DB column and stamped on ingest. (Reprocessing consumption of it is v2+; stamping now is cheap future-proofing.)
Acceptance: every newly-ingested doc carries a fingerprint that distinguishes pipeline configs.
Policies & process (adopt now — ~no build)
- PP1 — No re-bill on quality reprocessing. When reprocessing lands (v2+), a quality-driven re-chunk grandfathers the customer's billed chunk count; we absorb the recount. (D18 interaction.)
- PP2 — Eval gates pipeline-logic merges. Any change to chunking / segmentation / routing requires an Eval run (baseline-delta + floor check) before merge; owner = whoever changes the pipeline.
- PP3 — Probe-confidence safe fallback. The probe emits a confidence; low-confidence docs route to the safe LLM path and are flagged for the corpus. (Small router change; lands with the segmenter-integration epic, governed by this contract.)
Quality bar (v1)
- Floor = no regression vs the prod baseline, per question-type, plus a target lift to be set once the baseline is measured (recalibrated with beta feedback). Operating rule: minimize cost subject to the floor — the knee is a heuristic, the floor is the decision.
- recall@k figures are directional until the corpus/query set grows; v1 proves the mechanics + the code-vs-LLM signal, not a final frontier.
Explicitly deferred (v2)
Config-parameterized pipeline + Eval-mode frontier sweep; hash-cached embedding sweep infra; modeled cost estimator + calibration (error band); hierarchical (small-to-big) retrieval as an experimental config. v2 is the "optimization machine"; it's built once v1's deterministic foundation + pairwise arbiter prove which knobs matter.
Out of scope (later, when warranted)
- L2 region/table detection eval — the segmenter doesn't handle tables yet; figures are covered by #49.
- Reprocessing pipeline (targeted + incremental) — needs real customers + a shipped upgrade; pre-beta.
- In-app quality feedback → corpus — needs consent infra; later feature.
- Golden compaction — opportunistic tech-debt fix.
Dependencies & sequencing
- Builds on (shipped):
segment.ts(probe + segmenter),segment.test.ts(L0 snapshot + assertion harness), the corpus manifest + synthetic fixtures. - Gates: the segmenter-integration epic (Tier-1 better-atoms first) — it cannot merge until its boundary F1 ≥ baseline and the pairwise L3 shows no retrieval regression (PP2).
- Sequencing: R1–R4 (deterministic foundation) → validate Tier-1 integration → R5–R6 (pairwise arbiter) → decide Tier-2 (code-driven chunking) → v2 optimization machine.
Carried-open items (for the integration epic / blue-team follow-up)
OD2 (span-anchor serialization within a section), OD3 (hold-out size + rotation), OD6 (answer-judge cadence), OD7 (corpus copyright/compaction at scale), OD10 (pipeline config-surface contents), OD12 (feedback consent model).