Foundry Foundry

EPIC: Eval-Driven Pipeline Validation

Make the eval measure what the product actually runs — multiple indexes plus the agentic router, across a broad corpus — then use it to evaluate the [[unified-chunking-markdown]] design (currently-deployed vs proposed) on the real path. Builds on [[pipeline-eval-harness]] (the scorers) and [[eval-corpus-and-doe]] (the corpus and experiment process).

Status: SCOPED — red-teamed and blue-teamed 2026-06-04; ready for /hl:ship story breakdown. Final scope lives in the [[eval-driven-pipeline-validation-requirements]] contract. The two genuinely-new harness pieces (per-index/hybrid eval, and the documented qualitative comparison) graduate back into [[pipeline-eval-harness]] once settled.


Testing Strategy

This epic is test/eval infrastructure, so its "tests" are the scorers themselves plus guards that they're trustworthy.

Design

The core of the epic: the eval model and how the experiment uses it.

Context

Where this epic sits relative to the rest of the system — what depends on it, what it depends on, and the state of the world today.

Overview

This section frames what the epic is for, the problem it solves, and what is explicitly out of scope.

Decisions Log

DateDecisionRationaleAlternatives Considered
2026-06-04The eval must measure the path the product runs, not one isolated index.Today the harness scores vector search only; production runs an agentic multi-index router. We've been optimizing chunking against a metric the product doesn't use.Keep vector-only recall (cheap but unfaithful)
2026-06-04Three eval layers: (a) per-index + hybrid recall (deterministic workhorse), (b) end-to-end agentic recall, (c) answer-quality (qualitative spot-check, never a gate).Each answers a different question at a different cost; (c) stays non-gate per [[pipeline-eval-harness]].One LLM-judge metric (nondeterministic, rejected); vector-only (unfaithful)
2026-06-04This epic builds the instrument AND runs the first experiment with it. It does NOT wire any chunker into production.Building a measuring tool without using it is half the job; the experiment is the payoff and a live end-to-end smoke test.Defer the experiment to a later epic (instrument never validated against a real decision)
2026-06-04 (red-team)Per-index eval is reframed as a hybrid A/B: vector alone vs. vector combined with keyword and lookup.The real question is "how much does vector search cover, and when it misses, do keyword search and direct lookup recover it?" A lone combined number doesn't answer that; the A/B does.Score only each index in isolation (no recovery story); score only one lumped combination (can't attribute the lift)
2026-06-04 (red-team)Second prose document = a born-clean expository document (real paragraph breaks), not the NASA file.NASA's indent-only paragraphs trip the paragraph-collapse bug (which lives in the other epic) and would couple this epic's corpus work to it. NASA stays the geometry/bbox test document.Keep NASA and accept the cross-epic dependency; pull paragraph-splitting into this epic (violates the Non-Goal)
2026-06-04 (blue-team)Cut the automated agentic scored harness; lean on the existing manual localhost-vs-prod comparison for the end-to-end read.A one-time architecture decision doesn't need a repeatable scored agentic metric; the by-hand comparison covers it. The variance spike (which only sized that harness) goes with it.Build the automated harness (cost not justified for a one-off decision)
2026-06-04 (blue-team)The deployed-vs-proposed experiment stays IN this epic, as a lean three-way run (deployed / proposed grouping / windowing baseline) scored on the deterministic hybrid scorecard plus the manual read.It's the payoff, it's a live smoke test of the harness, and it yields a real product-improvement decision. Affordable now that the heavy agentic machinery is cut.Relocate it to the unified-chunking epic (instrument never closes the loop)
2026-06-04 (blue-team)The prose-question revalidation hook is in scope as the prerequisite that makes the experiment's re-chunked scores trustworthy.Prose answer locations are pipeline-generated, so re-chunking silently rots them; the experiment re-chunks.Skip it (re-chunked scores quietly wrong); relocate it (the experiment that needs it lives here)
2026-06-04 (blue-team)The precision fix (similarity threshold) is in scope inside the hybrid scorer.Without it the hybrid A/B can't tell "keyword recovered the answer" from "keyword returned everything" — the whole point of the comparison.Ship recall-only (can't back the "recovers without flooding" claim)
2026-06-04 (blue-team)Second verse document is deferred.Verse is already validated end-to-end; it's not where the quality pain is.Onboard it now (corpus breadth for its own sake)

Goals & Non-Goals

Goals:

  • Upgrade the harness to score the real retrieval system: each index on its own (vector / keyword / lookup) and the hybrid combination of them.
  • Quantify how much vector search covers on its own, and where keyword search and direct lookup recover its misses — especially the prose gaps pure-vector showed (entity-name flooding, diffuse facts).
  • Build corpus breadth (a second prose document) and the supporting maintenance (manifest cleanup, prose-question revalidation).
  • Run the deployed-vs-proposed experiment on the unified-chunking design and produce a written verdict — which doubles as a live end-to-end smoke test of the harness.

Non-Goals:

  • Wiring any chunker into production (separate epic, gated on this epic's verdict). Carve-out: making candidate chunkers runnable for evaluation (offline) IS in scope.
  • An automated, scored agentic harness — cut; the manual localhost-vs-prod comparison covers the end-to-end read.
  • LLM answer-judging as a gate — the qualitative comparison is a documented read, never a gate.
  • Tabular documents as a category (deferred).
  • Indent-aware paragraph splitting — that bug fix lives in the unified-chunking epic. We sidestep it here by choosing a born-clean second prose document.

Problem Statement

Our retrieval numbers (prose 0.80, structured 1.0 recall@10) come from vector search only — but production runs an agentic multi-index router with three tools (vector search, keyword search, direct section lookup). So the metric driving our chunking decisions isn't the system the product actually uses. Worse, the [[unified-chunking-markdown]] design's central question (does semantic grouping beat sentence-windowing?) is an A/B that can't be answered without (1) a broad corpus and (2) an eval that measures the real combination. We need the instrument before we can trust any chunking decision — including whether to build the unified-chunking design at all.

What Is This Epic?

The epic that makes the pipeline eval faithful and broad, then runs its first real experiment with it. It upgrades the scorer to measure each index plus the hybrid combination, grows the corpus, and then runs the three-way chunking bake-off to decide whether the unified-chunking design earns implementation. It turns the next chunking decision from argument into measurement — and proves the harness works end-to-end by using it.


Dependents

  • [[unified-chunking-markdown]] — its implement/don't-implement decision is the output of this epic's experiment.
  • All future chunking and routing decisions ([[pipeline-eval-harness]] frontier work) — gated on a faithful metric.

Dependencies

  • [[pipeline-eval-harness]] (scorers, baseline, recall core) — exists.
  • [[eval-corpus-and-doe]] (corpus matrix, question-generation pipeline, experiment registry) — exists, partly executed.
  • @autri/retrieval exposing vectorSearch / ftsSearch / lookupSection / route() — exists.
  • retrieval_log with per-tool source attribution — exists (the router already records which tool returned each chunk), which lowers the per-index measurement cost.

Current State

  • Eval: ingestion/eval/retrieval-db.ts calls vector search only. No keyword, lookup, or hybrid measurement.
  • Product retrieval — two surfaces:
    • Web app (app/app/docs/[id]/query/actions.ts) → route() → spawns the claude CLI (Haiku, Max-billed) with the doc-search tools attached; that sub-agent picks and composes the three tools and returns an answer plus the combined hits (source-attributed via retrieval_log).
    • Wedge (mcp-servers/doc-search) → the three searches are exposed as tools; the main chat LLM (Claude Desktop / Cursor / Copilot) picks them directly — no Haiku middle layer.
  • Corpus: structured docs format-validated (Constitution markdown plus FIA/STEM PDF). Prose has novel questions (recall 0.80). Verse validated end-to-end (Genesis, geometry-preserving, no LLM cost). Retrieval questions exist on roughly three documents.

Affected Systems

System / LayerHow It's Affected
ingestion/eval/* (harness)New scorers: per-index + hybrid recall, the precision fix, the registry stamp, the revalidation hook
@autri/retrieval (route, the three searches)Consumed as-is
retrieval_logRead for per-index source attribution (already populated)
Corpus and questions (fixtures/segmentation/golden)A born-clean second prose document added
semantic-chunk.ts + a new windowing baselineMade runnable offline for the experiment (not wired to production)
[[unified-chunking-markdown]]Becomes the system under test in the experiment

The Eval Model (two layers we build, one we already have)

The eval separates into layers, each answering a different question at a different cost and cadence.

LayerWhat it runsScored onDeterministic?Status
(a) per-index + hybrid recalleach index alone (vector / keyword / lookup), plus vector combined with the othersspan overlap vs. gold (existing rule)Yes (pinned embedder)build (item 2) — the workhorse
(c) qualitative comparisonthe full real answer, branch vs. prod, read by a humana side-by-side read of the answersNoalready have — document it (item 8)

The automated, scored agentic layer was cut — for a one-time decision the manual (c) comparison covers the end-to-end read.

Coverage vs. behavior. Per-index (a) is a coverage ceiling: "could this index, run on its own, surface the gold answer." It is not router behavior — the live router only invokes a given index for certain question shapes, so isolated keyword recall can overstate what the system does. The end-to-end behavior is what the manual (c) comparison shows. So (a) tells us where each index has coverage to give; (c) tells us how the real answers actually differ.

The Hybrid Scorer (the headline capability)

Generalize the retrieval scorer (retrieval-db.ts) from vector-only to run each index on its own — vector (semantic), keyword (full-text), and direct lookup — reporting recall and mean-reciprocal-rank for each. Then run the hybrid A/B: vector alone vs. vector combined with the other two. The question this answers, in plain terms: how much does vector search cover on its own, and when it misses something, do keyword search and direct lookup pick up the slack? It is deterministic and cheap, so it can be swept on every chunking config. Output: a per-index + hybrid scorecard, broken out by question type, that quantifies where vector falls short on prose and whether the other two indexes recover it.

The scorer also gets the precision fix (a similarity threshold, retrieval-db.ts:44) so that negative / out-of-scope questions are scoreable — without it, the A/B can't distinguish "keyword recovered the answer" from "keyword returned everything."

The Experiment (the payoff + smoke test)

Make the candidate chunkers runnable offline — the proposed semantic-grouping chunker (semantic-chunk.ts, prototyped) and a new sentence-windowing baseline — behind an evaluation flag (instrument glue, not production wiring). Then run the three-way bake-off on the corpus: currently-deployed vs. proposed grouping vs. windowing baseline, scored on the hybrid scorecard and read with the manual comparison. Run the prose-question revalidation hook first, since re-chunking can rot prose answer locations.

The output is a written verdict on whether — and which parts of — the unified-chunking design earn implementation. Running this end-to-end also serves as the harness's smoke test: if corpus → re-chunk → score → read works cleanly, the instrument is proven.

Data Model Changes

  • A light experiment-registry stamp records which corpus and git commit (plus config) produced each scorecard. Likely the in-repo JSON artifact.
  • No production schema change.

Edge Cases & Gotchas

ScenarioExpected BehaviorWhy It's Tricky
Keyword search appears to "win" by returning everythingThe scorecard must include negative / out-of-scope questions to measure precision (needs the threshold fix)Recall-only per-index is gameable
A per-index "win" read as router behaviorTreat (a) as a coverage ceiling; the real behavior is the manual comparisonIsolated keyword results overstate — the live router only invokes keyword search for some question shapes
Prose questions drift when the experiment re-chunksRun the revalidation hook before scoring the proposed and windowing configsProse answer locations are pipeline-generated, not authored
The manual comparison only tests the current router pathTo test the "no router needed" hypothesis, point the comparison at the wedge / main-chat-LLM path (local doc-search tools in Claude Desktop / Cursor)The web-app path runs the Haiku router; the wedge path is the no-router driver
The experiment ties (grouping ≈ windowing)Report the tie as the verdict — it means windowing wins on simplicity and the grouping complexity isn't earnedA tie is a real, valuable result, not a failure

Test Layers

LayerApplies?Notes
Unit testsYesPure scorer cores (per-index + hybrid recall) tested offline with synthetic gold — like retrieval.test.ts
Integration (evaluation mode)YesThe hybrid scorer run against the live DB; never the per-commit gate
Determinism guardYesThe hybrid scorer must be reproducible (pinned embedder)
Qualitative comparisonSpot-checkThe manual branch-vs-prod read; documented, not gated

Verification Rules

  1. The hybrid scorer stays deterministic and may gate; the qualitative comparison never gates.
  2. Every scorecard includes negative / out-of-scope questions.
  3. The experiment verdict is corpus- and commit-stamped.
  4. Question changes are reviewed by a human (correct-as-you-curate).

Stories

Scope settled by blue-team — see the [[eval-driven-pipeline-validation-requirements]] contract for full acceptance criteria and the out-of-scope list. Final scoped stories:

StorySummaryStatusPR
1Corpus: onboard a born-clean 2nd prose document + its retrieval questions; backfill manifest category + audit coverage
2Per-index + hybrid A/B scorer (vector alone vs. vector + keyword + lookup) + per-question-type coverage scorecard
3Precision fix: similarity threshold (retrieval-db.ts:44) so negative / out-of-scope questions are scoreable
4Light experiment-registry stamp (corpus + git commit + config → scorecard)
5Prose-question revalidation hook (re-check answer locations after a re-chunk)
6Make candidate chunkers runnable offline (proposed semantic-grouping + new windowing baseline)
7Run the three-way experiment (deployed / grouping / windowing) → written verdict + harness smoke test
8Document the manual localhost-vs-prod qualitative comparison (both driver paths)

Sequencing: build the instrument (1–4) → prepare for re-chunking (5, 6) → run the experiment (7, read with 8). Acceptance criteria get finalized during /hl:ship Step 1.

Cut/relocated (do not re-add): 2nd verse document (deferred), variance spike (cut), automated agentic scored harness (cut), repeatable qualitative-comparison tooling (deferred), production wiring of any chunker (separate epic). Full list in the requirements contract.


Open Questions / Red-Team Targets

Directional steer (2026-06-04, Dan): the chatting agent's tool selection should be the main chat LLM's job, not a separate Haiku sub-router. Once a question is asked, the main agent doing the chatting should choose among vector search / keyword search / direct lookup itself (the wedge pattern) — we should not interpose a sub-agent that picks tools on its behalf. This epic tests that hypothesis via the manual comparison pointed at the wedge path; the actual web-app migration off route() is a separate downstream decision.

Resolved by red-team / blue-team (2026-06-04)

  • Is an automated agentic eval worth building? No — cut for a one-time decision; the manual comparison covers the end-to-end read.
  • Does the hybrid combination need its own scoring? Yes — it's the headline capability (item 2): vector alone vs. vector + keyword + lookup.
  • Does the experiment belong in this epic? Yes — it's the payoff and a live smoke test; kept in as a lean three-way run.

Still open (small, for /hl:ship or later)

  • Web-app migration off the Haiku router to the main-LLM-picks-tools pattern — a separate decision, informed by this epic's wedge-path comparison.
  • Whether the qualitative comparison ever becomes a repeatable script — by-hand for now; revisit only if run constantly.

Known Issues / Tech Debt

IssueSeverityNotes
Negative / precision questions are unscoreable (no similarity threshold)Mediumretrieval-db.ts:44; fixed by story 3
Prose questions are chunker-specificMediumThe revalidation hook (story 5) must run before the experiment scores a re-chunked config
The sentence-windowing baseline chunker doesn't exist yetMediumA new build inside story 6; it's the third arm of the experiment
The two retrieval surfaces (web-app router vs. wedge) can divergeBy design; the manual comparison can be pointed at either driver

Epic doc — refined collaboratively, red-teamed and blue-teamed 2026-06-04. Final scope is the [[eval-driven-pipeline-validation-requirements]] contract. The new harness capabilities graduate into [[pipeline-eval-harness]] once settled. Next: /hl:ship story breakdown.

Review

🔒

Enter your access token to view annotations