Foundry Foundry

Requirements Contract: Eval-Driven Pipeline Validation

The scoped contract for the [[eval-driven-pipeline-validation]] epic, produced by blue-team on 2026-06-04 after the red-team pass. This is the authority on what is in and out of scope; the epic doc holds the full design rationale. Where the epic's draft story list disagrees with this contract, the contract wins.

Status: SCOPED — ready for /hl:ship story breakdown.


Purpose

In one line: make the eval measure how retrieval actually behaves — each index, the hybrid combination, and the real end-to-end answer — then use it once to decide whether the proposed unified-chunking design is worth building.

This epic delivers a tight, faithful test harness and runs its first real experiment with it. The experiment is deliberately kept in scope for three reasons: it is the payoff (a real product decision), it is a live end-to-end smoke test that proves the harness works, and it produces a takeaway that makes the product better — not just tooling for later.


In Scope (Must Have)

These are the committed deliverables. Each has a plain acceptance check.

1. Corpus breadth — a second prose document. Onboard one born-clean expository prose document (real paragraph breaks, so it does not hit the known indent-paragraph-collapse bug that lives in the other epic) and author its retrieval questions through the standard process (a sub-agent drafts questions across question types, code checks each answer location exists, a human reviews and corrects). Also backfill the category label on the document manifest and audit the corpus against the coverage plan. Done when: the new prose document is ingested with a reviewed question set (including a few deliberately-unanswerable questions), and the manifest audit shows what each category covers.

2. Per-index and hybrid scorer. Generalize the retrieval scorer from vector-only to run each index on its own — vector (semantic), keyword (full-text), and direct lookup — and to run the hybrid combination. The headline question it answers: how much does vector search cover on its own, and when it misses, do keyword search and direct lookup pick up the slack? Done when: a per-index and hybrid scorecard, broken out by question type, runs across the corpus and quantifies vector's coverage plus where the other two indexes recover its misses. Read as a coverage ceiling, not as router behavior.

3. Precision fix for unanswerable questions. Add the missing similarity threshold (retrieval-db.ts:44) so negative / out-of-scope questions are scoreable. Without this, the hybrid comparison cannot tell "keyword recovered the answer" from "keyword returned everything" — which is the whole point of the comparison. Done when: the scorecard reports precision on negative questions, not just recall.

4. Light experiment registry. A simple stamp recording which corpus and which git commit produced each scorecard (plus the config). No agent-driver or repeat-count fields — those belonged to the cut machinery. Done when: every scorecard can be traced back to the corpus snapshot, commit, and config that produced it.

5. Prose-question revalidation hook. For prose, the correct answer location for each question is generated by the pipeline, not authored — so re-chunking can silently rot it. This hook re-checks every prose answer location after a re-chunk and flags stale ones. It is the prerequisite that makes the experiment's re-chunked scores trustworthy. Done when: running it after a re-chunk reports which prose answer locations are still valid and which need re-curation.

6. Candidate chunkers runnable for evaluation. Make the proposed semantic-grouping chunker (already prototyped) and a new sentence-windowing baseline chunker runnable offline (chunk, embed, score) behind an evaluation flag. This is instrument glue — not wiring either chunker into production. Done when: both candidate chunkers can produce chunks for the corpus offline and feed the scorer.

7. The experiment — three-way comparison (the payoff + smoke test). Run the bake-off: currently-deployed chunking vs. the proposed semantic-grouping vs. the windowing baseline. Score all three on the deterministic per-index and hybrid scorecard, and read the real answers with the manual comparison (item 8). Running this end-to-end also serves as the harness smoke test. Done when: there is a written verdict on whether — and which parts of — the unified-chunking design earn implementation, with the supporting scorecards and a note that the run exercised the full harness end-to-end.

8. Document the manual qualitative comparison. Write down the existing workflow: boot the feature branch on localhost and the deployed product app, ask the same question list, read the answers side by side. Include the note that pointing this at the wedge / main-chat-LLM-with-tools path (via the local doc-search tools in Claude Desktop or Cursor) is how the "does the chat LLM need a separate router?" hypothesis gets tested. This stays a by-hand practice for now. Done when: the workflow is documented clearly enough to run cold, including the two driver paths.


Explicitly Out of Scope

Cut or relocated, with the reason — so nobody re-adds them by accident.

ItemDispositionWhy
Second verse document + its questionsDeferredVerse is already validated end-to-end; it is not where the quality pain is. Add when something needs it.
Agent tool-choice variance spikeCutIts only job was to size the automated agentic harness, which we cut.
Automated agentic-retrieval scored harnessCutA one-time architecture decision does not need a repeatable scored agentic metric; the manual localhost-vs-prod comparison covers the end-to-end read.
Pinned main-LLM proxy, N-repeat spread, retrieval-log coverage verificationCutAll were mechanics of the automated agentic harness.
A repeatable script for the qualitative comparisonDeferred (nice-to-have)The by-hand workflow is enough for now; revisit only if it is run constantly.
Wiring any candidate chunker into productionOut (separate epic)This epic measures; the unified-chunking epic implements, gated on this epic's verdict.
Indent-aware paragraph splitting (the collapse bug fix)Out (other epic)Sidestepped here by choosing a born-clean prose document.
Tabular documents as a corpus categoryDeferredNot designed for yet.

Sequencing

The work falls into three stages; the third depends on the first two.

  1. Build the instrument: corpus (item 1) + per-index/hybrid scorer with the precision fix (items 2, 3) + the light registry (item 4).
  2. Prepare for re-chunking: make the candidate chunkers runnable (item 6) + the revalidation hook (item 5).
  3. Run the experiment: the three-way comparison (item 7) read alongside the documented manual comparison (item 8) → a verdict.

Definition of Done (epic)

The epic is done when: the per-index and hybrid scorecard runs across the corpus and quantifies vector coverage plus recovery by keyword and lookup; the three-way experiment has produced a written verdict on the unified-chunking design; and that experiment run has served as a clean end-to-end smoke test of the harness. The two new harness capabilities (per-index/hybrid eval, and the documented qualitative comparison) then graduate back into [[pipeline-eval-harness]].


Blue-teamed 2026-06-04. Supersedes the epic's draft story table for scope. Next: /hl:ship story breakdown.

Review

🔒

Enter your access token to view annotations