Eval Corpus & DoE Process — Sub-system Design Doc

Draft 2026-06-03. Extends [[pipeline-eval-harness]]. Status: DRAFT — for red/blue-team next session. The harness (segment/boundary/retrieval scorers, frozen baseline) exists and works; this doc is about the corpus it scores against and the process for growing it + running experiments — so pipeline decisions generalize instead of overfitting to one doc.

Architecture

Overview

What Is This Sub-system?

The eval corpus (the set of test documents + their gold) and the Design-of-Experiments (DoE) process that runs on top of it. The harness answers "did this pipeline change improve quality?"; this sub-system answers "on which documents, and how do we keep adding more without it becoming a hand-authoring slog?"

Two deliverables:

A format × category coverage matrix of test docs (public-domain), deliberately filled to catch overfitting.
A repeatable onboarding pipeline (parse → chunk → semi-automated query-gold generation) + an experiment registry so DoE results accumulate instead of dying in throwaway scripts.

The Story

We hardened structured-doc chunking against a single doc (SRWF26 Competition Regs). It worked — until we added a second structured doc (Tech Regs) and discovered the recall@1 "win" from a reranking strategy was Comp-specific overfitting: it vanished on Tech (which was already at ceiling). One extra doc flipped a conclusion we were about to ship.

That's the thesis of this sub-system: n=1 lies. Every pipeline decision (chunking, routing, retrieval) must be validated across a spread of documents — different formats, different structural categories — or we tune to quirks. The harness made experiments cheap and deterministic; now we need the corpus to be broad and the gold to be cheap to produce, or breadth stays bottlenecked on hand-authoring.

Current Status

Corpus today: boundary gold on ~9 docs (synthetic, FIA A–F, SRWF Comp/Tech, brehob); retrieval query gold on exactly 2 (Comp 22q, Tech 16q — both structured). Prose (novel) + verse (genesis) are embedded but have no query gold.
Categories proven: STRUCTURED (deterministic section-aware chunker, shipped + default-on). PROSE + VERSE semantic-grouping validated by prototype (adjacency-similarity valleys recover real boundaries — creation-days in Genesis, scene/dialogue beats in the novel) but not built or gold-scored.
Gold generation: fully manual today. seed-queries.ts + emit-gold.ts + the curated flag are the skeleton of a semi-automated path; not yet wired into a process.
Experiments: run ad-hoc in throwaway _*.ts scripts; no registry.

Red-team caveat (2026-06-03) — STRUCTURED is not yet validated to our own ≥2-format bar. Comp + Tech are sibling docs: both PDF, both FIA, both SRWF26. The "2nd doc caught the reranking overfit" win is real but weaker than it looks — two docs of the same format and source. STRUCTURED is n≥2 in count but n=1 in format; it is not format-validated until a non-PDF structured doc scores clean. This is why the US Constitution (md) leads the backlog (D-J).

System Boundary

In scope: the corpus manifest, public-domain test docs + their gold artifacts, the gold-generation process, the experiment registry/conventions, the coverage matrix. Out of scope: the scorers themselves ([[pipeline-eval-harness]]) and the production pipeline ([[design]]).

The Two Axes (the core model)

Document format and document category are orthogonal and drive different halves of the pipeline. Conflating them is the main thing to avoid.

Format = how it's encoded → drives PARSING (text + geometry extraction, front of pipeline). PDF (text-layer / image) · docx · xlsx · csv · html · markdown · epub · plaintext.
Category = how it's semantically organized → drives CHUNKING (unit detection, the work we've been doing).
- STRUCTURED — author-declared numbered hierarchy (regs, contracts, the US Constitution). Units read deterministically from numbering. Shipped.
- FLAT-VERSE — flat author-declared units (scripture book/chapter/verse). Boundaries free; units too small → group up to passages. Prototype.
- PROSE — no declared units (novels, essays, articles). Units inferred via semantic grouping. Prototype.
- TABULAR — rows/columns (xlsx, csv). A genuinely new category with its own unit question (row? row-group? sheet?). Unexplored — open question.

An HTML doc may be PROSE (blog) or STRUCTURED (API docs); a PDF may be any category. So the corpus is a matrix, and we fill cells deliberately. A pipeline decision must hold across a row (category, varying format) and a column (format, varying category) before we trust it.

Coverage Matrix (target — fill deliberately)

	PDF	docx	html	xlsx/csv	epub/md
STRUCTURED	✅ SRWF Comp/Tech, FIA	(contract?)	(API docs?)	—	⬜ US Constitution md
FLAT-VERSE	✅ WEB Bible (genesis…)	—	—	—	⬜ 2nd verse doc
PROSE	—	✅ novel	(PD prose — not Wikipedia)	—	⬜ Gutenberg epub (2nd prose)
TABULAR	—	—	—	🔒 deferred (gold-model risk)	—

✅ = have it · ⬜ = priority gap · 🔒 = deferred · (?) = candidate. The point isn't to fill every cell — it's to have ≥2 docs per category across ≥2 formats before declaring a category "solved."

Backlog order (locked 2026-06-03 — D-J):

US Constitution (markdown, STRUCTURED) — warm-up. Doubles as (a) the first non-PDF structured doc → closes the format-variety hole on the one "solved" category, and (b) the first end-to-end run of the semi-automated gold-gen pipeline below.
2nd PROSE + 2nd VERSE doc — bring both prototype-validated categories to n≥2 so they can be built + gold-scored with confidence.
TABULAR — deferred (D-E); may not fit the current gold model at all.

Copyright note (G12): a generic "wiki article" candidate is not public-domain — Wikipedia is CC-BY-SA. Prefer Project Gutenberg / US-gov / synthetic PD sources for the html-prose cell. Gold stores IDs + page only regardless.

Corpus Backlog — Candidate Documents (brainstormed 2026-06-04)

Public-domain / licensing-safe candidates per category, to fill the matrix over time. Licensing rule (G12): Project Gutenberg / US-gov / synthetic only — no Wikipedia (CC-BY-SA). Compounding goal: prefer candidates that also fill an empty format cell. Gold stores IDs + page only regardless of source.

PROSE

✅ SELECTED (eval-driven-pipeline-validation epic, 2nd prose doc): a NASA History Series monograph in HTML (e.g. This New Ocean: A History of Project Mercury, NASA SP-4201) — born-clean HTML (explicit paragraph markup → dodges the indent-collapse bug), continuous narrative-expository prose, US-gov public domain, fact-dense (names/dates/events → strong retrieval questions), modern register, fills the PROSE×html cell. Trim to a few chapters. Continues the existing NASA thread.
9/11 Commission Report (a chapter) — US-gov PD, modern fact-dense long-form prose; within-chapter is continuous prose.
Darwin, On the Origin of Species (Gutenberg HTML/epub, trimmed) — classic science exposition; fills the epub cell; 19th-c. register.
The Federalist Papers (Gutenberg) — cross-reference-rich essays; caveat: light authored structure (numbered papers) muddies the pure-prose label.

FLAT-VERSE (2nd verse doc — deferred from the current epic; backlog)

Tao Te Ching (Legge, 1891) — tiny aphoristic numbered units; stresses "group up to passages" hardest; very different from Genesis narrative.
Quran (an older PD translation — sura:ayah) — large, very different content.
Psalms (WEB/KJV) — poetic verse, distinct from Genesis's narrative-in-verses.

STRUCTURED (already n≥2; these add subtype/format variety, not urgent)

Bitcoin whitepaper (Nakamoto, 2008; MIT-licensed) — academic/technical-paper subtype: numbered sections + equations + figures. Genuinely structured (numbered sections → deterministic boundaries), NOT prose. A nice subtype contrast with the regs/Constitution.
A CFR part or US Code title — deeply numbered hierarchy, US-gov PD; the platonic structured doc.
Geneva Conventions / a treaty — numbered articles.
A contract or model code (docx) — fills the structured×docx cell.

TABULAR — deferred (D-E, gold-model risk). Candidates if/when it's designed: data.gov CSVs, CIA World Factbook. No matrix pressure until the gold model supports it.

Onboarding a New Test Doc (the gold-gen pipeline)

The bottleneck for breadth. Hand-authoring Tech's 16 queries took real grounding effort; that doesn't scale. Target: author → review, not author-from-scratch. Apply [[feedback_llm_does_semantics_code_does_mechanics]]:

Ingest the doc (parse → chunk → embed) under the current pipeline.
LLM proposes (semantic) — a fresh Opus-family sub-agent is handed the doc's sections/content and a fixed list of query buckets, and drafts N queries per bucket + candidate answerSections. Buckets: lookup, vocab-mismatch, cross-reference, scoping, definitional, and negative/out-of-scope (the last with empty answerSections — see precision requirement).
Code validates (mechanical) — every proposed answerSection ID must exist in the doc's sections table + be reachable; drop/flag invalid ones. (Extends seed-queries.ts.)
Human curates by correction (judgment) — the validated draft is presented back to the curator in the AskUserQuestion review format (the same one we red-team with): batched, curator flags anything off, fixes wrong answer-sections, cuts bad queries, then sets curated: true. Review, not authorship.

A different model family or a blind independent answer-key was considered and not required for now (D-D) — an Opus sub-agent + human-flag-in-review is the agreed rigor bar. Revisit only if a precision/quality problem traces back to gold bias.

Precision requirement (locked, D-D buckets): every onboarded doc's gold MUST include N negative / out-of-scope queries (empty answerSections). Recall-only gold is gameable — recall@10 = 1.0 is trivially won by returning everything; negatives are how we measure that the pipeline doesn't surface the wrong chunk. The harness already supports empty answerSections.

Gold-revalidation step (locked, D-G): gold answer-keys are section_ids. STRUCTURED/VERSE ids are author-declared (stable across re-chunking); PROSE ids are pipeline-generated and will break when the chunker changes. So onboarding is not one-and-done: on any chunker change, re-run code-validation (step 3) of every gold answerSection against the new sections table and flag stale ids for re-curation. This makes gold-rot visible and cheap instead of silent.

This keeps the human on the load-bearing judgment (is this a fair query? is the expected section right?) and offloads the mechanical drafting. The emit-gold curate-by-correction pattern already does this for boundary gold; generalize it to query gold.

Experiment Registry (DoE — steps 4–7)

This session's experiments (retrieval-strategy bake-off, semantic-grouping prototype) lived in throwaway scripts and the conclusions in chat. Formalize:

A lightweight registry (one row per experiment: hypothesis, strategy, docs scored, result, verdict) so results accumulate and we don't re-run settled questions ("reranking is a dead end" should be recorded once).
Convention: an experiment scores across a category row of the matrix, not one doc (the anti-overfit rule, made procedural).

Effect-size floor (locked, D-F). A verdict may not rest on an aggregate-only delta. Two requirements before a result counts:

The metric delta must clear a minimum-effect threshold — small deltas are noise (the reranking "win" was +0.016 MRR ≈ a fraction of one query over 16).
The experiment output must show which individual queries changed rank, not just the aggregate. A verdict with no moved queries to point at is noise.

(Bootstrap confidence intervals were considered and deferred — more machinery than the min-delta + moved-queries rule, which catches the same trap cheaply.)

Verdict snapshot-stamping (locked, D-H). Every registry row records the corpus it was valid against (docs + git SHA). A verdict ("reranking is a dead end") is corpus-relative — the same mechanism that produced it (adding a doc) can flip it back when prose/verse/tabular land. When the corpus grows past a verdict's snapshot, the verdict auto-flags as 'revisit' rather than being trusted forever.

Full-corpus sweep + re-freeze ritual (locked, D-I). Boundary eval is in the per-commit gate; retrieval eval (which actually caught the overfit) is too heavy/DB-bound for per-commit CI and stays Eval-mode. To stop silent retrieval regressions: before merging any pipeline change, run a full-corpus retrieval sweep vs the frozen baseline. Re-freezing baseline-snapshot.json is a deliberate act with a documented trigger (an intended, reviewed pipeline improvement), never an incidental side effect.

Key Interfaces

fixtures/segmentation/manifest.json — the corpus registry (format, category, gold flags). Extend with a category field (STRUCTURED / FLAT-VERSE / PROSE / TABULAR) + the format×format coverage intent. Migration (G11): backfill category on the existing ~9 boundary-gold docs as part of the schema change (a one-time tagging pass).
*.queries.json / *.boundaries.json — per-doc gold (the curated flag gates gating use). Query gold now includes negative/out-of-scope entries (empty answerSections).
Gold-gen CLI (to build): seed-queries.ts → Opus sub-agent propose → code-validate → human curate-in-review-format. Re-runnable as the gold-revalidation step (D-G) on chunker change.
Experiment registry (to build): format TBD (markdown table vs JSON) — OQ#4. Note: the snapshot-stamping requirement (D-H, git SHA per row) leans toward an in-repo JSON artifact the harness appends to (auto-stamps SHA) over a hand-maintained Foundry table.

Decisions Log

D-A (locked, this session): semantic grouping is NOT the universal default. Where the author declared units (STRUCTURED numbering, VERSE numbering) reading them is more correct, deterministic, cheaper (zero model calls), and citation-preserving than inferring them. Semantic grouping is the tool for the absence of structure (PROSE). Unification lives at the shared PageGrouping seam, not at the unit-detector. Similarity may serve as an optional shared size-refinement layer (merge runts / split giants) on top of structure — open (OQ#3).
D-B (locked, this session): n≥2 docs per category before trusting a pipeline decision. Direct lesson from the Comp-overfit episode. Made procedural in the experiment-registry convention. Caveat (red-team): the 2 docs must vary by format/source to count — Comp+Tech are siblings (PDF/FIA), so STRUCTURED is n≥2 in count but n=1 in format.
D-C (locked): format and category are orthogonal axes. Parsing keys on format; chunking keys on category. The corpus is a matrix.
D-D (locked, red-team 2026-06-03): gold generation is Opus-sub-agent-propose → code-validate → human-curate-in-review-format. A fresh Opus-family sub-agent drafts queries by bucket (lookup, vocab-mismatch, cross-reference, scoping, definitional, negative/out-of-scope); code validates answer-section IDs; the curator reviews the draft in the AskUserQuestion format and flags/corrects. Different-model-family / blind answer-key NOT required at this stage — revisit only if gold-bias is shown to matter. Precision sub-decision: every doc's gold includes N negative/out-of-scope queries (empty answerSections) to measure precision, not just recall.
D-E (open, scoped down): TABULAR is deferred, not just unspecified. It may break the gold model itself — a spreadsheet may have no sections and may not pass through the PageGrouping seam, so section_id-anchored gold may not apply. Named as a future category; no matrix pressure until it's designed for real. (OQ#1.)
D-F (locked, red-team): effect-size floor on verdicts. A result counts only if its metric delta clears a minimum-effect threshold AND the experiment shows which individual queries moved. No aggregate-only verdicts. Bootstrap CIs deferred.
D-G (locked, red-team): gold has a revalidation step. On any chunker change, re-run code-validation of every gold answerSection against the new sections table; flag stale ids for re-curation. Prevents silent prose-gold rot.
D-H (locked, red-team): registry verdicts are corpus-stamped. Each row records docs + git SHA; a verdict auto-flags 'revisit' when the corpus grows past its snapshot.
D-I (locked, red-team): retrieval has a sweep + re-freeze ritual. Full-corpus retrieval sweep vs frozen baseline before merging a pipeline change; re-freeze is deliberate with a documented trigger. Retrieval stays out of per-commit CI.
D-J (locked, red-team): corpus backlog order. US Constitution (md, STRUCTURED) warm-up → 2nd PROSE + 2nd VERSE doc → TABULAR deferred. Onboarding cadence is on-demand per experiment (no fixed schedule).

Risks & Constraints

Gold quality is the eval's foundation — LLM-proposed gold risks baking the LLM's biases into the "ground truth." Guard (D-D): Opus sub-agent proposes → code-validates IDs → human curates in the review format, flagging anything off. Judged sufficient for now; the deeper guards (different model family, blind answer-key) are held in reserve if a gold-bias problem surfaces.
Recall-only gold is gameable — recall@10 = 1.0 is trivially won by returning everything. Mitigated by the negative/out-of-scope precision requirement (D-D).
Prose gold rots silently — PROSE answer-sections are pipeline-generated ids that break on re-chunking. Mitigated by the gold-revalidation step (D-G).
Retrieval regressions are ungated — retrieval eval is Eval-mode (DB+OpenAI), not per-commit CI; the dimension that caught the overfit could regress unnoticed. Mitigated by the pre-merge sweep + re-freeze ritual (D-I), not by CI.
Copyright — corpus docs must be public-domain or owned (WEB Bible, Project Gutenberg, US gov docs, synthetic). Wikipedia is CC-BY-SA, not PD — exclude it from the html-prose cell; prefer Gutenberg/US-gov/synthetic. Gold stores IDs + page only, never copyrighted body text (existing discipline).
Matrix combinatorics — format × category × buckets explodes; resist filling every cell. ≥2 docs/category across ≥2 formats is the bar, not exhaustiveness.
Retrieval gold needs embeddings + DB (Eval-mode, never the gate) — onboarding a doc for retrieval eval is heavier than for boundary eval.

Known Issues / Tech Debt

Grader findings logged this session (chunking-quality backlog, not corpus-process per se but surfaced here):

alt-TOC format leaking on Tech ("Article T3: …car 19" — colon+title+pageno, no dotted leader); isTocLine misses it.
Revision-list changelog noise ("LIST OF REVISIONS" front-matter id references) still duplicate-tag.
Intra-digit corruption ("C2.1 5" = C2.15) — 2 residual Comp headers.
Page-aware boundary scorer — scoreBoundaries matches id-only; should match id+page (latent bug that masked the 38-header recovery).

Open Questions (for red/blue-team)

Resolved in red-team 2026-06-03: OQ1 (TABULAR → D-E defer + gold-model flag), OQ2 (gold rigor → D-D), OQ5 (matrix priorities → D-J), OQ6 (precision → required, D-D). Remaining open:

Similarity-as-refinement — is the merge-runts/split-giants layer worth building on top of structure, or YAGNI? (D-A leaves it optional.)
Experiment registry format — markdown table in Foundry, or a JSON artifact in-repo the harness appends to? The snapshot-stamping requirement (D-H, git SHA per row) leans in-repo JSON, but not locked.

Eval Corpus & DoE Process — Sub-system Design Doc#

Architecture#

Overview#

What Is This Sub-system?#

The Story#

Current Status#

System Boundary#

The Two Axes (the core model)#

Coverage Matrix (target — fill deliberately)#

Corpus Backlog — Candidate Documents (brainstormed 2026-06-04)#

Onboarding a New Test Doc (the gold-gen pipeline)#

Experiment Registry (DoE — steps 4–7)#

Key Interfaces#

Decisions Log#

Risks & Constraints#

Known Issues / Tech Debt#

Open Questions (for red/blue-team)#

Review