Pipeline Eval Harness — v1 Requirements

The scope contract for the first build of the Pipeline Evaluation & Optimization Harness. Produced by the 2026-06-02 red-team → blue-team pass. Grounds the build; everything not listed as v1 is explicitly deferred or out.

Design doc: sub-systems/pipeline-eval-harness.md · Decisions Log there is authoritative for rationale.

Scope at a glance

v1's job: the leanest harness that can (a) tune the segmenter against ground truth and (b) decide code-vs-LLM for the segmenter-integration epic — without building the full optimization machine. The deterministic layers ship first; the frontier sweep waits until the foundation is proven.

Tier	What	In v1?
L1 boundary eval	section_id-anchored gold + P/R/F1 scorer	✅ MUST
Curate-by-correction	`--emit-gold` authoring	✅ MUST
Baseline	current prod chunking measured first	✅ MUST
Corpus	3 curated docs + hold-out	✅ MUST
L3 retrieval eval	pairwise A/B (not sweep): query gold + recall@k	✅ v1 (SHOULD→in)
Cost axis	reuse `recordedCostUsd` (exact on real runs)	✅ v1 (lightweight)
Pipeline fingerprint	composite version column + stamp on ingest	✅ v1
Config-parameterized pipeline + Eval sweep	the optimization machine	⛔ v2
Modeled cost estimator + calibration	predictive cost for sweeps	⛔ v2
Hierarchical retrieval	experimental config	⛔ v2
L2 region/table eval · reprocessing pipeline · in-app feedback	—	⛔ out (later)

v1 Requirements (build)

R1 — L1 boundary eval. A boundary scorer comparing segmenter output to curated gold. Gold is section_id-anchored ([(section_id, page)]); scoring reports precision / recall / F1 of chunk-start boundaries. Runs in the CI gate (<a few seconds). Acceptance: a segmenter change re-scores all corpus docs; F1 deltas are visible per doc.

R2 — Curate-by-correction. An --emit-gold mode that dumps current segmenter output in the gold format for hand-correction; corrected file is committed as gold. Goldens carry ids/spans/counts only — no body text. Acceptance: adding a new gold doc is a ≤1-command emit + manual fix, no from-scratch authoring.

R3 — Baseline-first. The current production chunking config is measured as frontier point zero; every candidate is reported as a delta vs baseline. Acceptance: "is this better than prod today?" is answerable for boundary F1 and recall@k.

R4 — Corpus. Three curated docs — one clean reg (FIA-B or STEM Competition), one nasty (FIA-C: appendix + tables + mixed scheme), one negative (prose/verse) — plus a tuning hold-out never used to tune. Registered in manifest.json with gold. Acceptance: the 3 docs + hold-out run green; over-fit is observable (hold-out scored separately).

R5 — L3 retrieval eval (pairwise A/B). Query gold = query → answer span(s), span-anchored to section_id, organized by question-type (lookup, vocab-mismatch, cross-reference, scoping, definitional, negative, figure/table). Authored by seed-from-headings (breadth) + ~10 hand-authored hard queries/doc (depth), LLM-generated candidates with human-validated spans; target ~30+/doc. The scorer embeds two chunkings (current vs candidate), runs queries, reports recall@k per question-type, evaluated under both per-doc-isolated and whole-corpus index scopes. No config-sweep machinery — direct A-vs-B. Acceptance: one real code-vs-LLM comparison yields per-type recall@k under both index scopes.

R6 — Cost axis (lightweight). Cost reported from the existing recordedCostUsd summed over a real run (LLM path) plus embedding spend (code path). No predictive model. Acceptance: each A/B comparison emits an exact cost number alongside quality.

R7 — Pipeline version fingerprint. A composite per-document fingerprint (parse·structure·segment·extractor·embedder·routing-config-hash) added as a DB column and stamped on ingest. (Reprocessing consumption of it is v2+; stamping now is cheap future-proofing.) Acceptance: every newly-ingested doc carries a fingerprint that distinguishes pipeline configs.

Policies & process (adopt now — ~no build)

PP1 — No re-bill on quality reprocessing. When reprocessing lands (v2+), a quality-driven re-chunk grandfathers the customer's billed chunk count; we absorb the recount. (D18 interaction.)
PP2 — Eval gates pipeline-logic merges. Any change to chunking / segmentation / routing requires an Eval run (baseline-delta + floor check) before merge; owner = whoever changes the pipeline.
PP3 — Probe-confidence safe fallback. The probe emits a confidence; low-confidence docs route to the safe LLM path and are flagged for the corpus. (Small router change; lands with the segmenter-integration epic, governed by this contract.)

Quality bar (v1)

Floor = no regression vs the prod baseline, per question-type, plus a target lift to be set once the baseline is measured (recalibrated with beta feedback). Operating rule: minimize cost subject to the floor — the knee is a heuristic, the floor is the decision.
recall@k figures are directional until the corpus/query set grows; v1 proves the mechanics + the code-vs-LLM signal, not a final frontier.

Explicitly deferred (v2)

Config-parameterized pipeline + Eval-mode frontier sweep; hash-cached embedding sweep infra; modeled cost estimator + calibration (error band); hierarchical (small-to-big) retrieval as an experimental config. v2 is the "optimization machine"; it's built once v1's deterministic foundation + pairwise arbiter prove which knobs matter.

Out of scope (later, when warranted)

L2 region/table detection eval — the segmenter doesn't handle tables yet; figures are covered by #49.
Reprocessing pipeline (targeted + incremental) — needs real customers + a shipped upgrade; pre-beta.
In-app quality feedback → corpus — needs consent infra; later feature.
Golden compaction — opportunistic tech-debt fix.

Dependencies & sequencing

Builds on (shipped): segment.ts (probe + segmenter), segment.test.ts (L0 snapshot + assertion harness), the corpus manifest + synthetic fixtures.
Gates: the segmenter-integration epic (Tier-1 better-atoms first) — it cannot merge until its boundary F1 ≥ baseline and the pairwise L3 shows no retrieval regression (PP2).
Sequencing: R1–R4 (deterministic foundation) → validate Tier-1 integration → R5–R6 (pairwise arbiter) → decide Tier-2 (code-driven chunking) → v2 optimization machine.

Carried-open items (for the integration epic / blue-team follow-up)

OD2 (span-anchor serialization within a section), OD3 (hold-out size + rotation), OD6 (answer-judge cadence), OD7 (corpus copyright/compaction at scale), OD10 (pipeline config-surface contents), OD12 (feedback consent model).

Pipeline Eval Harness — v1 Requirements#

Scope at a glance#

v1 Requirements (build)#

Policies & process (adopt now — ~no build)#

Quality bar (v1)#

Explicitly deferred (v2)#

Out of scope (later, when warranted)#

Dependencies & sequencing#

Carried-open items (for the integration epic / blue-team follow-up)#

Review