EPIC: Eval-Driven Pipeline Validation
Make the eval measure what the product actually runs — multiple indexes plus the agentic router, across a broad corpus — then use it to evaluate the [[unified-chunking-markdown]] design (currently-deployed vs proposed) on the real path. Builds on [[pipeline-eval-harness]] (the scorers) and [[eval-corpus-and-doe]] (the corpus and experiment process).
Status: SCOPED — red-teamed and blue-teamed 2026-06-04; ready for /hl:ship story breakdown. Final scope lives in the [[eval-driven-pipeline-validation-requirements]] contract. The two genuinely-new harness pieces (per-index/hybrid eval, and the documented qualitative comparison) graduate back into [[pipeline-eval-harness]] once settled.
Testing Strategy
This epic is test/eval infrastructure, so its "tests" are the scorers themselves plus guards that they're trustworthy.
Design
The core of the epic: the eval model and how the experiment uses it.
Context
Where this epic sits relative to the rest of the system — what depends on it, what it depends on, and the state of the world today.
Overview
This section frames what the epic is for, the problem it solves, and what is explicitly out of scope.
Decisions Log
| Date | Decision | Rationale | Alternatives Considered |
|---|---|---|---|
| 2026-06-04 | The eval must measure the path the product runs, not one isolated index. | Today the harness scores vector search only; production runs an agentic multi-index router. We've been optimizing chunking against a metric the product doesn't use. | Keep vector-only recall (cheap but unfaithful) |
| 2026-06-04 | Three eval layers: (a) per-index + hybrid recall (deterministic workhorse), (b) end-to-end agentic recall, (c) answer-quality (qualitative spot-check, never a gate). | Each answers a different question at a different cost; (c) stays non-gate per [[pipeline-eval-harness]]. | One LLM-judge metric (nondeterministic, rejected); vector-only (unfaithful) |
| 2026-06-04 | This epic builds the instrument AND runs the first experiment with it. It does NOT wire any chunker into production. | Building a measuring tool without using it is half the job; the experiment is the payoff and a live end-to-end smoke test. | Defer the experiment to a later epic (instrument never validated against a real decision) |
| 2026-06-04 (red-team) | Per-index eval is reframed as a hybrid A/B: vector alone vs. vector combined with keyword and lookup. | The real question is "how much does vector search cover, and when it misses, do keyword search and direct lookup recover it?" A lone combined number doesn't answer that; the A/B does. | Score only each index in isolation (no recovery story); score only one lumped combination (can't attribute the lift) |
| 2026-06-04 (red-team) | Second prose document = a born-clean expository document (real paragraph breaks), not the NASA file. | NASA's indent-only paragraphs trip the paragraph-collapse bug (which lives in the other epic) and would couple this epic's corpus work to it. NASA stays the geometry/bbox test document. | Keep NASA and accept the cross-epic dependency; pull paragraph-splitting into this epic (violates the Non-Goal) |
| 2026-06-04 (blue-team) | Cut the automated agentic scored harness; lean on the existing manual localhost-vs-prod comparison for the end-to-end read. | A one-time architecture decision doesn't need a repeatable scored agentic metric; the by-hand comparison covers it. The variance spike (which only sized that harness) goes with it. | Build the automated harness (cost not justified for a one-off decision) |
| 2026-06-04 (blue-team) | The deployed-vs-proposed experiment stays IN this epic, as a lean three-way run (deployed / proposed grouping / windowing baseline) scored on the deterministic hybrid scorecard plus the manual read. | It's the payoff, it's a live smoke test of the harness, and it yields a real product-improvement decision. Affordable now that the heavy agentic machinery is cut. | Relocate it to the unified-chunking epic (instrument never closes the loop) |
| 2026-06-04 (blue-team) | The prose-question revalidation hook is in scope as the prerequisite that makes the experiment's re-chunked scores trustworthy. | Prose answer locations are pipeline-generated, so re-chunking silently rots them; the experiment re-chunks. | Skip it (re-chunked scores quietly wrong); relocate it (the experiment that needs it lives here) |
| 2026-06-04 (blue-team) | The precision fix (similarity threshold) is in scope inside the hybrid scorer. | Without it the hybrid A/B can't tell "keyword recovered the answer" from "keyword returned everything" — the whole point of the comparison. | Ship recall-only (can't back the "recovers without flooding" claim) |
| 2026-06-04 (blue-team) | Second verse document is deferred. | Verse is already validated end-to-end; it's not where the quality pain is. | Onboard it now (corpus breadth for its own sake) |
Goals & Non-Goals
Goals:
- Upgrade the harness to score the real retrieval system: each index on its own (vector / keyword / lookup) and the hybrid combination of them.
- Quantify how much vector search covers on its own, and where keyword search and direct lookup recover its misses — especially the prose gaps pure-vector showed (entity-name flooding, diffuse facts).
- Build corpus breadth (a second prose document) and the supporting maintenance (manifest cleanup, prose-question revalidation).
- Run the deployed-vs-proposed experiment on the unified-chunking design and produce a written verdict — which doubles as a live end-to-end smoke test of the harness.
Non-Goals:
- Wiring any chunker into production (separate epic, gated on this epic's verdict). Carve-out: making candidate chunkers runnable for evaluation (offline) IS in scope.
- An automated, scored agentic harness — cut; the manual localhost-vs-prod comparison covers the end-to-end read.
- LLM answer-judging as a gate — the qualitative comparison is a documented read, never a gate.
- Tabular documents as a category (deferred).
- Indent-aware paragraph splitting — that bug fix lives in the unified-chunking epic. We sidestep it here by choosing a born-clean second prose document.
Problem Statement
Our retrieval numbers (prose 0.80, structured 1.0 recall@10) come from vector search only — but production runs an agentic multi-index router with three tools (vector search, keyword search, direct section lookup). So the metric driving our chunking decisions isn't the system the product actually uses. Worse, the [[unified-chunking-markdown]] design's central question (does semantic grouping beat sentence-windowing?) is an A/B that can't be answered without (1) a broad corpus and (2) an eval that measures the real combination. We need the instrument before we can trust any chunking decision — including whether to build the unified-chunking design at all.
What Is This Epic?
The epic that makes the pipeline eval faithful and broad, then runs its first real experiment with it. It upgrades the scorer to measure each index plus the hybrid combination, grows the corpus, and then runs the three-way chunking bake-off to decide whether the unified-chunking design earns implementation. It turns the next chunking decision from argument into measurement — and proves the harness works end-to-end by using it.
Dependents
- [[unified-chunking-markdown]] — its implement/don't-implement decision is the output of this epic's experiment.
- All future chunking and routing decisions ([[pipeline-eval-harness]] frontier work) — gated on a faithful metric.
Dependencies
- [[pipeline-eval-harness]] (scorers, baseline, recall core) — exists.
- [[eval-corpus-and-doe]] (corpus matrix, question-generation pipeline, experiment registry) — exists, partly executed.
@autri/retrievalexposingvectorSearch/ftsSearch/lookupSection/route()— exists.retrieval_logwith per-tool source attribution — exists (the router already records which tool returned each chunk), which lowers the per-index measurement cost.
Current State
- Eval:
ingestion/eval/retrieval-db.tscalls vector search only. No keyword, lookup, or hybrid measurement. - Product retrieval — two surfaces:
- Web app (
app/app/docs/[id]/query/actions.ts) →route()→ spawns theclaudeCLI (Haiku, Max-billed) with the doc-search tools attached; that sub-agent picks and composes the three tools and returns an answer plus the combined hits (source-attributed viaretrieval_log). - Wedge (
mcp-servers/doc-search) → the three searches are exposed as tools; the main chat LLM (Claude Desktop / Cursor / Copilot) picks them directly — no Haiku middle layer.
- Web app (
- Corpus: structured docs format-validated (Constitution markdown plus FIA/STEM PDF). Prose has novel questions (recall 0.80). Verse validated end-to-end (Genesis, geometry-preserving, no LLM cost). Retrieval questions exist on roughly three documents.
Affected Systems
| System / Layer | How It's Affected |
|---|---|
ingestion/eval/* (harness) | New scorers: per-index + hybrid recall, the precision fix, the registry stamp, the revalidation hook |
@autri/retrieval (route, the three searches) | Consumed as-is |
retrieval_log | Read for per-index source attribution (already populated) |
Corpus and questions (fixtures/segmentation/golden) | A born-clean second prose document added |
semantic-chunk.ts + a new windowing baseline | Made runnable offline for the experiment (not wired to production) |
| [[unified-chunking-markdown]] | Becomes the system under test in the experiment |
The Eval Model (two layers we build, one we already have)
The eval separates into layers, each answering a different question at a different cost and cadence.
| Layer | What it runs | Scored on | Deterministic? | Status |
|---|---|---|---|---|
| (a) per-index + hybrid recall | each index alone (vector / keyword / lookup), plus vector combined with the others | span overlap vs. gold (existing rule) | Yes (pinned embedder) | build (item 2) — the workhorse |
| (c) qualitative comparison | the full real answer, branch vs. prod, read by a human | a side-by-side read of the answers | No | already have — document it (item 8) |
The automated, scored agentic layer was cut — for a one-time decision the manual (c) comparison covers the end-to-end read.
Coverage vs. behavior. Per-index (a) is a coverage ceiling: "could this index, run on its own, surface the gold answer." It is not router behavior — the live router only invokes a given index for certain question shapes, so isolated keyword recall can overstate what the system does. The end-to-end behavior is what the manual (c) comparison shows. So (a) tells us where each index has coverage to give; (c) tells us how the real answers actually differ.
The Hybrid Scorer (the headline capability)
Generalize the retrieval scorer (retrieval-db.ts) from vector-only to run each index on its own — vector (semantic), keyword (full-text), and direct lookup — reporting recall and mean-reciprocal-rank for each. Then run the hybrid A/B: vector alone vs. vector combined with the other two. The question this answers, in plain terms: how much does vector search cover on its own, and when it misses something, do keyword search and direct lookup pick up the slack? It is deterministic and cheap, so it can be swept on every chunking config. Output: a per-index + hybrid scorecard, broken out by question type, that quantifies where vector falls short on prose and whether the other two indexes recover it.
The scorer also gets the precision fix (a similarity threshold, retrieval-db.ts:44) so that negative / out-of-scope questions are scoreable — without it, the A/B can't distinguish "keyword recovered the answer" from "keyword returned everything."
The Experiment (the payoff + smoke test)
Make the candidate chunkers runnable offline — the proposed semantic-grouping chunker (semantic-chunk.ts, prototyped) and a new sentence-windowing baseline — behind an evaluation flag (instrument glue, not production wiring). Then run the three-way bake-off on the corpus: currently-deployed vs. proposed grouping vs. windowing baseline, scored on the hybrid scorecard and read with the manual comparison. Run the prose-question revalidation hook first, since re-chunking can rot prose answer locations.
The output is a written verdict on whether — and which parts of — the unified-chunking design earn implementation. Running this end-to-end also serves as the harness's smoke test: if corpus → re-chunk → score → read works cleanly, the instrument is proven.
Data Model Changes
- A light experiment-registry stamp records which corpus and git commit (plus config) produced each scorecard. Likely the in-repo JSON artifact.
- No production schema change.
Edge Cases & Gotchas
| Scenario | Expected Behavior | Why It's Tricky |
|---|---|---|
| Keyword search appears to "win" by returning everything | The scorecard must include negative / out-of-scope questions to measure precision (needs the threshold fix) | Recall-only per-index is gameable |
| A per-index "win" read as router behavior | Treat (a) as a coverage ceiling; the real behavior is the manual comparison | Isolated keyword results overstate — the live router only invokes keyword search for some question shapes |
| Prose questions drift when the experiment re-chunks | Run the revalidation hook before scoring the proposed and windowing configs | Prose answer locations are pipeline-generated, not authored |
| The manual comparison only tests the current router path | To test the "no router needed" hypothesis, point the comparison at the wedge / main-chat-LLM path (local doc-search tools in Claude Desktop / Cursor) | The web-app path runs the Haiku router; the wedge path is the no-router driver |
| The experiment ties (grouping ≈ windowing) | Report the tie as the verdict — it means windowing wins on simplicity and the grouping complexity isn't earned | A tie is a real, valuable result, not a failure |
Test Layers
| Layer | Applies? | Notes |
|---|---|---|
| Unit tests | Yes | Pure scorer cores (per-index + hybrid recall) tested offline with synthetic gold — like retrieval.test.ts |
| Integration (evaluation mode) | Yes | The hybrid scorer run against the live DB; never the per-commit gate |
| Determinism guard | Yes | The hybrid scorer must be reproducible (pinned embedder) |
| Qualitative comparison | Spot-check | The manual branch-vs-prod read; documented, not gated |
Verification Rules
- The hybrid scorer stays deterministic and may gate; the qualitative comparison never gates.
- Every scorecard includes negative / out-of-scope questions.
- The experiment verdict is corpus- and commit-stamped.
- Question changes are reviewed by a human (correct-as-you-curate).
Stories
Scope settled by blue-team — see the [[eval-driven-pipeline-validation-requirements]] contract for full acceptance criteria and the out-of-scope list. Final scoped stories:
| Story | Summary | Status | PR |
|---|---|---|---|
| 1 | Corpus: onboard a born-clean 2nd prose document + its retrieval questions; backfill manifest category + audit coverage | ||
| 2 | Per-index + hybrid A/B scorer (vector alone vs. vector + keyword + lookup) + per-question-type coverage scorecard | ||
| 3 | Precision fix: similarity threshold (retrieval-db.ts:44) so negative / out-of-scope questions are scoreable | ||
| 4 | Light experiment-registry stamp (corpus + git commit + config → scorecard) | ||
| 5 | Prose-question revalidation hook (re-check answer locations after a re-chunk) | ||
| 6 | Make candidate chunkers runnable offline (proposed semantic-grouping + new windowing baseline) | ||
| 7 | Run the three-way experiment (deployed / grouping / windowing) → written verdict + harness smoke test | ||
| 8 | Document the manual localhost-vs-prod qualitative comparison (both driver paths) |
Sequencing: build the instrument (1–4) → prepare for re-chunking (5, 6) → run the experiment (7, read with 8). Acceptance criteria get finalized during /hl:ship Step 1.
Cut/relocated (do not re-add): 2nd verse document (deferred), variance spike (cut), automated agentic scored harness (cut), repeatable qualitative-comparison tooling (deferred), production wiring of any chunker (separate epic). Full list in the requirements contract.
Open Questions / Red-Team Targets
Directional steer (2026-06-04, Dan): the chatting agent's tool selection should be the main chat LLM's job, not a separate Haiku sub-router. Once a question is asked, the main agent doing the chatting should choose among vector search / keyword search / direct lookup itself (the wedge pattern) — we should not interpose a sub-agent that picks tools on its behalf. This epic tests that hypothesis via the manual comparison pointed at the wedge path; the actual web-app migration off route() is a separate downstream decision.
Resolved by red-team / blue-team (2026-06-04)
- Is an automated agentic eval worth building? No — cut for a one-time decision; the manual comparison covers the end-to-end read.
- Does the hybrid combination need its own scoring? Yes — it's the headline capability (item 2): vector alone vs. vector + keyword + lookup.
- Does the experiment belong in this epic? Yes — it's the payoff and a live smoke test; kept in as a lean three-way run.
Still open (small, for /hl:ship or later)
- Web-app migration off the Haiku router to the main-LLM-picks-tools pattern — a separate decision, informed by this epic's wedge-path comparison.
- Whether the qualitative comparison ever becomes a repeatable script — by-hand for now; revisit only if run constantly.
Known Issues / Tech Debt
| Issue | Severity | Notes |
|---|---|---|
| Negative / precision questions are unscoreable (no similarity threshold) | Medium | retrieval-db.ts:44; fixed by story 3 |
| Prose questions are chunker-specific | Medium | The revalidation hook (story 5) must run before the experiment scores a re-chunked config |
| The sentence-windowing baseline chunker doesn't exist yet | Medium | A new build inside story 6; it's the third arm of the experiment |
| The two retrieval surfaces (web-app router vs. wedge) can diverge | — | By design; the manual comparison can be pointed at either driver |
Epic doc — refined collaboratively, red-teamed and blue-teamed 2026-06-04. Final scope is the [[eval-driven-pipeline-validation-requirements]] contract. The new harness capabilities graduate into [[pipeline-eval-harness]] once settled. Next: /hl:ship story breakdown.