Eval Corpus & DoE Process — Sub-system Design Doc
Draft 2026-06-03. Extends [[pipeline-eval-harness]]. Status: DRAFT — for red/blue-team next session. The harness (segment/boundary/retrieval scorers, frozen baseline) exists and works; this doc is about the corpus it scores against and the process for growing it + running experiments — so pipeline decisions generalize instead of overfitting to one doc.
Architecture
Overview
What Is This Sub-system?
The eval corpus (the set of test documents + their gold) and the Design-of-Experiments (DoE) process that runs on top of it. The harness answers "did this pipeline change improve quality?"; this sub-system answers "on which documents, and how do we keep adding more without it becoming a hand-authoring slog?"
Two deliverables:
- A format × category coverage matrix of test docs (public-domain), deliberately filled to catch overfitting.
- A repeatable onboarding pipeline (parse → chunk → semi-automated query-gold generation) + an experiment registry so DoE results accumulate instead of dying in throwaway scripts.
The Story
We hardened structured-doc chunking against a single doc (SRWF26 Competition Regs). It worked — until we added a second structured doc (Tech Regs) and discovered the recall@1 "win" from a reranking strategy was Comp-specific overfitting: it vanished on Tech (which was already at ceiling). One extra doc flipped a conclusion we were about to ship.
That's the thesis of this sub-system: n=1 lies. Every pipeline decision (chunking, routing, retrieval) must be validated across a spread of documents — different formats, different structural categories — or we tune to quirks. The harness made experiments cheap and deterministic; now we need the corpus to be broad and the gold to be cheap to produce, or breadth stays bottlenecked on hand-authoring.
Current Status
- Corpus today: boundary gold on ~9 docs (synthetic, FIA A–F, SRWF Comp/Tech, brehob); retrieval query gold on exactly 2 (Comp 22q, Tech 16q — both structured). Prose (novel) + verse (genesis) are embedded but have no query gold.
- Categories proven: STRUCTURED (deterministic section-aware chunker, shipped + default-on). PROSE + VERSE semantic-grouping validated by prototype (adjacency-similarity valleys recover real boundaries — creation-days in Genesis, scene/dialogue beats in the novel) but not built or gold-scored.
- Gold generation: fully manual today.
seed-queries.ts+emit-gold.ts+ thecuratedflag are the skeleton of a semi-automated path; not yet wired into a process. - Experiments: run ad-hoc in throwaway
_*.tsscripts; no registry.
Red-team caveat (2026-06-03) — STRUCTURED is not yet validated to our own ≥2-format bar. Comp + Tech are sibling docs: both PDF, both FIA, both SRWF26. The "2nd doc caught the reranking overfit" win is real but weaker than it looks — two docs of the same format and source. STRUCTURED is n≥2 in count but n=1 in format; it is not format-validated until a non-PDF structured doc scores clean. This is why the US Constitution (md) leads the backlog (D-J).
System Boundary
In scope: the corpus manifest, public-domain test docs + their gold artifacts, the gold-generation process, the experiment registry/conventions, the coverage matrix. Out of scope: the scorers themselves ([[pipeline-eval-harness]]) and the production pipeline ([[design]]).
The Two Axes (the core model)
Document format and document category are orthogonal and drive different halves of the pipeline. Conflating them is the main thing to avoid.
- Format = how it's encoded → drives PARSING (text + geometry extraction, front of pipeline). PDF (text-layer / image) · docx · xlsx · csv · html · markdown · epub · plaintext.
- Category = how it's semantically organized → drives CHUNKING (unit detection, the work we've been doing).
- STRUCTURED — author-declared numbered hierarchy (regs, contracts, the US Constitution). Units read deterministically from numbering. Shipped.
- FLAT-VERSE — flat author-declared units (scripture book/chapter/verse). Boundaries free; units too small → group up to passages. Prototype.
- PROSE — no declared units (novels, essays, articles). Units inferred via semantic grouping. Prototype.
- TABULAR — rows/columns (xlsx, csv). A genuinely new category with its own unit question (row? row-group? sheet?). Unexplored — open question.
An HTML doc may be PROSE (blog) or STRUCTURED (API docs); a PDF may be any category. So the corpus is a matrix, and we fill cells deliberately. A pipeline decision must hold across a row (category, varying format) and a column (format, varying category) before we trust it.
Coverage Matrix (target — fill deliberately)
| docx | html | xlsx/csv | epub/md | ||
|---|---|---|---|---|---|
| STRUCTURED | ✅ SRWF Comp/Tech, FIA | (contract?) | (API docs?) | — | ⬜ US Constitution md |
| FLAT-VERSE | ✅ WEB Bible (genesis…) | — | — | — | ⬜ 2nd verse doc |
| PROSE | — | ✅ novel | (PD prose — not Wikipedia) | — | ⬜ Gutenberg epub (2nd prose) |
| TABULAR | — | — | — | 🔒 deferred (gold-model risk) | — |
✅ = have it · ⬜ = priority gap · 🔒 = deferred · (?) = candidate. The point isn't to fill every cell — it's to have ≥2 docs per category across ≥2 formats before declaring a category "solved."
Backlog order (locked 2026-06-03 — D-J):
- US Constitution (markdown, STRUCTURED) — warm-up. Doubles as (a) the first non-PDF structured doc → closes the format-variety hole on the one "solved" category, and (b) the first end-to-end run of the semi-automated gold-gen pipeline below.
- 2nd PROSE + 2nd VERSE doc — bring both prototype-validated categories to n≥2 so they can be built + gold-scored with confidence.
- TABULAR — deferred (D-E); may not fit the current gold model at all.
Copyright note (G12): a generic "wiki article" candidate is not public-domain — Wikipedia is CC-BY-SA. Prefer Project Gutenberg / US-gov / synthetic PD sources for the html-prose cell. Gold stores IDs + page only regardless.
Corpus Backlog — Candidate Documents (brainstormed 2026-06-04)
Public-domain / licensing-safe candidates per category, to fill the matrix over time. Licensing rule (G12): Project Gutenberg / US-gov / synthetic only — no Wikipedia (CC-BY-SA). Compounding goal: prefer candidates that also fill an empty format cell. Gold stores IDs + page only regardless of source.
PROSE
- ✅ SELECTED (eval-driven-pipeline-validation epic, 2nd prose doc): a NASA History Series monograph in HTML (e.g. This New Ocean: A History of Project Mercury, NASA SP-4201) — born-clean HTML (explicit paragraph markup → dodges the indent-collapse bug), continuous narrative-expository prose, US-gov public domain, fact-dense (names/dates/events → strong retrieval questions), modern register, fills the PROSE×html cell. Trim to a few chapters. Continues the existing NASA thread.
- 9/11 Commission Report (a chapter) — US-gov PD, modern fact-dense long-form prose; within-chapter is continuous prose.
- Darwin, On the Origin of Species (Gutenberg HTML/epub, trimmed) — classic science exposition; fills the epub cell; 19th-c. register.
- The Federalist Papers (Gutenberg) — cross-reference-rich essays; caveat: light authored structure (numbered papers) muddies the pure-prose label.
FLAT-VERSE (2nd verse doc — deferred from the current epic; backlog)
- Tao Te Ching (Legge, 1891) — tiny aphoristic numbered units; stresses "group up to passages" hardest; very different from Genesis narrative.
- Quran (an older PD translation — sura:ayah) — large, very different content.
- Psalms (WEB/KJV) — poetic verse, distinct from Genesis's narrative-in-verses.
STRUCTURED (already n≥2; these add subtype/format variety, not urgent)
- Bitcoin whitepaper (Nakamoto, 2008; MIT-licensed) — academic/technical-paper subtype: numbered sections + equations + figures. Genuinely structured (numbered sections → deterministic boundaries), NOT prose. A nice subtype contrast with the regs/Constitution.
- A CFR part or US Code title — deeply numbered hierarchy, US-gov PD; the platonic structured doc.
- Geneva Conventions / a treaty — numbered articles.
- A contract or model code (docx) — fills the structured×docx cell.
TABULAR — deferred (D-E, gold-model risk). Candidates if/when it's designed: data.gov CSVs, CIA World Factbook. No matrix pressure until the gold model supports it.
Onboarding a New Test Doc (the gold-gen pipeline)
The bottleneck for breadth. Hand-authoring Tech's 16 queries took real grounding effort; that doesn't scale. Target: author → review, not author-from-scratch. Apply [[feedback_llm_does_semantics_code_does_mechanics]]:
- Ingest the doc (parse → chunk → embed) under the current pipeline.
- LLM proposes (semantic) — a fresh Opus-family sub-agent is handed the doc's sections/content and a fixed list of query buckets, and drafts N queries per bucket + candidate
answerSections. Buckets: lookup, vocab-mismatch, cross-reference, scoping, definitional, and negative/out-of-scope (the last with emptyanswerSections— see precision requirement). - Code validates (mechanical) — every proposed
answerSectionID must exist in the doc'ssectionstable + be reachable; drop/flag invalid ones. (Extendsseed-queries.ts.) - Human curates by correction (judgment) — the validated draft is presented back to the curator in the AskUserQuestion review format (the same one we red-team with): batched, curator flags anything off, fixes wrong answer-sections, cuts bad queries, then sets
curated: true. Review, not authorship.
A different model family or a blind independent answer-key was considered and not required for now (D-D) — an Opus sub-agent + human-flag-in-review is the agreed rigor bar. Revisit only if a precision/quality problem traces back to gold bias.
Precision requirement (locked, D-D buckets): every onboarded doc's gold MUST include N negative / out-of-scope queries (empty answerSections). Recall-only gold is gameable — recall@10 = 1.0 is trivially won by returning everything; negatives are how we measure that the pipeline doesn't surface the wrong chunk. The harness already supports empty answerSections.
Gold-revalidation step (locked, D-G): gold answer-keys are section_ids. STRUCTURED/VERSE ids are author-declared (stable across re-chunking); PROSE ids are pipeline-generated and will break when the chunker changes. So onboarding is not one-and-done: on any chunker change, re-run code-validation (step 3) of every gold answerSection against the new sections table and flag stale ids for re-curation. This makes gold-rot visible and cheap instead of silent.
This keeps the human on the load-bearing judgment (is this a fair query? is the expected section right?) and offloads the mechanical drafting. The emit-gold curate-by-correction pattern already does this for boundary gold; generalize it to query gold.
Experiment Registry (DoE — steps 4–7)
This session's experiments (retrieval-strategy bake-off, semantic-grouping prototype) lived in throwaway scripts and the conclusions in chat. Formalize:
- A lightweight registry (one row per experiment: hypothesis, strategy, docs scored, result, verdict) so results accumulate and we don't re-run settled questions ("reranking is a dead end" should be recorded once).
- Convention: an experiment scores across a category row of the matrix, not one doc (the anti-overfit rule, made procedural).
Effect-size floor (locked, D-F). A verdict may not rest on an aggregate-only delta. Two requirements before a result counts:
- The metric delta must clear a minimum-effect threshold — small deltas are noise (the reranking "win" was +0.016 MRR ≈ a fraction of one query over 16).
- The experiment output must show which individual queries changed rank, not just the aggregate. A verdict with no moved queries to point at is noise.
(Bootstrap confidence intervals were considered and deferred — more machinery than the min-delta + moved-queries rule, which catches the same trap cheaply.)
Verdict snapshot-stamping (locked, D-H). Every registry row records the corpus it was valid against (docs + git SHA). A verdict ("reranking is a dead end") is corpus-relative — the same mechanism that produced it (adding a doc) can flip it back when prose/verse/tabular land. When the corpus grows past a verdict's snapshot, the verdict auto-flags as 'revisit' rather than being trusted forever.
Full-corpus sweep + re-freeze ritual (locked, D-I). Boundary eval is in the per-commit gate; retrieval eval (which actually caught the overfit) is too heavy/DB-bound for per-commit CI and stays Eval-mode. To stop silent retrieval regressions: before merging any pipeline change, run a full-corpus retrieval sweep vs the frozen baseline. Re-freezing baseline-snapshot.json is a deliberate act with a documented trigger (an intended, reviewed pipeline improvement), never an incidental side effect.
Key Interfaces
fixtures/segmentation/manifest.json— the corpus registry (format, category, gold flags). Extend with acategoryfield (STRUCTURED / FLAT-VERSE / PROSE / TABULAR) + the format×format coverage intent. Migration (G11): backfillcategoryon the existing ~9 boundary-gold docs as part of the schema change (a one-time tagging pass).*.queries.json/*.boundaries.json— per-doc gold (thecuratedflag gates gating use). Query gold now includes negative/out-of-scope entries (emptyanswerSections).- Gold-gen CLI (to build):
seed-queries.ts→ Opus sub-agent propose → code-validate → human curate-in-review-format. Re-runnable as the gold-revalidation step (D-G) on chunker change. - Experiment registry (to build): format TBD (markdown table vs JSON) — OQ#4. Note: the snapshot-stamping requirement (D-H, git SHA per row) leans toward an in-repo JSON artifact the harness appends to (auto-stamps SHA) over a hand-maintained Foundry table.
Decisions Log
- D-A (locked, this session): semantic grouping is NOT the universal default. Where the author declared units (STRUCTURED numbering, VERSE numbering) reading them is more correct, deterministic, cheaper (zero model calls), and citation-preserving than inferring them. Semantic grouping is the tool for the absence of structure (PROSE). Unification lives at the shared
PageGroupingseam, not at the unit-detector. Similarity may serve as an optional shared size-refinement layer (merge runts / split giants) on top of structure — open (OQ#3). - D-B (locked, this session): n≥2 docs per category before trusting a pipeline decision. Direct lesson from the Comp-overfit episode. Made procedural in the experiment-registry convention. Caveat (red-team): the 2 docs must vary by format/source to count — Comp+Tech are siblings (PDF/FIA), so STRUCTURED is n≥2 in count but n=1 in format.
- D-C (locked): format and category are orthogonal axes. Parsing keys on format; chunking keys on category. The corpus is a matrix.
- D-D (locked, red-team 2026-06-03): gold generation is Opus-sub-agent-propose → code-validate → human-curate-in-review-format. A fresh Opus-family sub-agent drafts queries by bucket (lookup, vocab-mismatch, cross-reference, scoping, definitional, negative/out-of-scope); code validates answer-section IDs; the curator reviews the draft in the AskUserQuestion format and flags/corrects. Different-model-family / blind answer-key NOT required at this stage — revisit only if gold-bias is shown to matter. Precision sub-decision: every doc's gold includes N negative/out-of-scope queries (empty
answerSections) to measure precision, not just recall. - D-E (open, scoped down): TABULAR is deferred, not just unspecified. It may break the gold model itself — a spreadsheet may have no sections and may not pass through the
PageGroupingseam, sosection_id-anchored gold may not apply. Named as a future category; no matrix pressure until it's designed for real. (OQ#1.) - D-F (locked, red-team): effect-size floor on verdicts. A result counts only if its metric delta clears a minimum-effect threshold AND the experiment shows which individual queries moved. No aggregate-only verdicts. Bootstrap CIs deferred.
- D-G (locked, red-team): gold has a revalidation step. On any chunker change, re-run code-validation of every gold
answerSectionagainst the newsectionstable; flag stale ids for re-curation. Prevents silent prose-gold rot. - D-H (locked, red-team): registry verdicts are corpus-stamped. Each row records docs + git SHA; a verdict auto-flags 'revisit' when the corpus grows past its snapshot.
- D-I (locked, red-team): retrieval has a sweep + re-freeze ritual. Full-corpus retrieval sweep vs frozen baseline before merging a pipeline change; re-freeze is deliberate with a documented trigger. Retrieval stays out of per-commit CI.
- D-J (locked, red-team): corpus backlog order. US Constitution (md, STRUCTURED) warm-up → 2nd PROSE + 2nd VERSE doc → TABULAR deferred. Onboarding cadence is on-demand per experiment (no fixed schedule).
Risks & Constraints
- Gold quality is the eval's foundation — LLM-proposed gold risks baking the LLM's biases into the "ground truth." Guard (D-D): Opus sub-agent proposes → code-validates IDs → human curates in the review format, flagging anything off. Judged sufficient for now; the deeper guards (different model family, blind answer-key) are held in reserve if a gold-bias problem surfaces.
- Recall-only gold is gameable —
recall@10 = 1.0is trivially won by returning everything. Mitigated by the negative/out-of-scope precision requirement (D-D). - Prose gold rots silently — PROSE answer-sections are pipeline-generated ids that break on re-chunking. Mitigated by the gold-revalidation step (D-G).
- Retrieval regressions are ungated — retrieval eval is Eval-mode (DB+OpenAI), not per-commit CI; the dimension that caught the overfit could regress unnoticed. Mitigated by the pre-merge sweep + re-freeze ritual (D-I), not by CI.
- Copyright — corpus docs must be public-domain or owned (WEB Bible, Project Gutenberg, US gov docs, synthetic). Wikipedia is CC-BY-SA, not PD — exclude it from the html-prose cell; prefer Gutenberg/US-gov/synthetic. Gold stores IDs + page only, never copyrighted body text (existing discipline).
- Matrix combinatorics — format × category × buckets explodes; resist filling every cell. ≥2 docs/category across ≥2 formats is the bar, not exhaustiveness.
- Retrieval gold needs embeddings + DB (Eval-mode, never the gate) — onboarding a doc for retrieval eval is heavier than for boundary eval.
Known Issues / Tech Debt
Grader findings logged this session (chunking-quality backlog, not corpus-process per se but surfaced here):
- alt-TOC format leaking on Tech (
"Article T3: …car 19"— colon+title+pageno, no dotted leader);isTocLinemisses it. - Revision-list changelog noise ("LIST OF REVISIONS" front-matter id references) still duplicate-tag.
- Intra-digit corruption (
"C2.1 5"= C2.15) — 2 residual Comp headers. - Page-aware boundary scorer —
scoreBoundariesmatches id-only; should match id+page (latent bug that masked the 38-header recovery).
Open Questions (for red/blue-team)
Resolved in red-team 2026-06-03: OQ1 (TABULAR → D-E defer + gold-model flag), OQ2 (gold rigor → D-D), OQ5 (matrix priorities → D-J), OQ6 (precision → required, D-D). Remaining open:
- Similarity-as-refinement — is the merge-runts/split-giants layer worth building on top of structure, or YAGNI? (D-A leaves it optional.)
- Experiment registry format — markdown table in Foundry, or a JSON artifact in-repo the harness appends to? The snapshot-stamping requirement (D-H, git SHA per row) leans in-repo JSON, but not locked.