Epic: Source Legibility & Bbox Trust

Created 2026-06-16 from the local-app shakedown (FIA Section-C technical regs PDF). Refined + red/blue-teamed 2026-06-16: scope locked at W1 + W2 + W3; decisions (D1–D8) + story split below. Not on the Brehob go-live critical path — see Roadmap Fit.

Why this epic

The inspector's promise is trust through legibility: a user can see exactly where in the source document each retrieved chunk came from, so they can vet quality before trusting an answer. Today that promise holds for clean PDFs but breaks in two places:

Converted docs (docx/xlsx) have no source view at all — they ingest as text with no rendered page, so the inspector shows the chunk text but a blank source pane. There is no "where did this come from" signal for these types.
Dense PDFs produce noisy/oversized bounding boxes — surfaced on the FIA regs: list/table-heavy clauses get grouped into one chunk (one had 72 bbox regions), so a "section" looked like the whole page was covered in boxes.

Two acute fixes already shipped (see below); this epic captures the deeper, deterministic work to make source legibility solid across all doc types, plus a way to measure bbox quality so it can't silently regress.

Already shipped (2026-06-16, context — not part of the remaining scope)

DocPanel active-only overlay — the chat source viewer now draws only the clicked citation's region(s), not every chunk cited from that page. Immediately de-clutters dense pages. (app/components/chat/DocPanel.tsx.)
Layout-aware bbox column clustering — clusterUnionByPageAndColumn now groups a chunk's regions into columns by actual horizontal overlap instead of a fixed page-midline (which spuriously split wide single-column docs into side-by-side boxes). Single-column → one box; genuine two-column (Bible) → two. Pure + unit-tested in the gate. Forward-looking: existing docs need a re-ingest to get the tighter boxes. (ingestion/extractor/extractor.ts.)

Decisions (locked — red-team 2026-06-16)

Red-teamed against the real repo (chunk source model, the markdown/paragraph pipeline, the eval harness, the inspector render), not the doc's own assertions. Two factual claims were corrected (see F1/F2 in W1). Scope: W1 + W2 + W3 all this round — W2 was reconsidered in, because measuring bad boxes (W3) without fixing the cause (W2) is only half the closed loop.

D1 — Source-locator model: two optional locators, keyed by the existing sourceType. Keep bbox (pixel rectangles, normalized 0–1 {page,x,y,w,h}) for PDFs; add a text-range locator (sourceText + charStart/charEnd) for converted docs. The inspector branches on sourceType, which is already on every citation (it already forks today — blank pane for non-PDF). No discriminated-union refactor: bbox already flows as opaque unknown, so a union would buy modeling taste at the cost of cross-cutting churn (retrieval + app + DB).
D2 — W1 highlight = preformatted text + highlight span (v1). Render the stored markdown as preformatted/monospaced text and highlight the char-range directly. Trivially correct on offsets; sidesteps the real problem that a raw char-offset into markdown source does NOT map cleanly onto react-markdown's rendered DOM (naive substring-and-wrap breaks syntax spanning the boundary — tables, bold). Pretty-rendered markdown is a deferred fidelity upgrade. Mirrors this epic's own "deterministic-first, render-to-image-later" instinct.
D3 — W1 storage = persist at ingest, against the FULL markdown. Store the normalized markdown once (per page/chapter/sheet) and stamp each chunk's char-range at write time. The offset math already exists (eval/span.ts: buildSourceModel + spanForParagraphKeys) and every grouping already carries paragraph_ids — but (a) nothing is persisted (the markdown is cache-only; paragraph_ids aren't on chunks) and (b) the eval's buildSourceModel is body-only (drops headings/footnotes), so reusing it as-is would compute offsets into the wrong string. Persisting the full markdown at ingest avoids read-time reconstruction AND the body-only trap. Cost: a schema change + a re-ingest of existing converted docs.
D4 — W3 metrics are invariants / self-consistency checks, NOT corpus-calibrated baselines. We do not calibrate thresholds off the corpus distribution — a systematically-bad corpus would bake the bug in as "normal." Each check asks "is this locator geometrically consistent with its own chunk": out-of-bounds (coords outside 0–1) → zero tolerance; cross-chunk overlap (two chunks claiming the same pixels) → should be ~0, flag any meaningful overlap (epsilon for rounding); tightness/fill (box area vs the actual word-ink inside it) → flag loose boxes covering whitespace (the over-grouping symptom); fragmentation/region-count (one chunk, many scattered regions — the FIA 72-region case) → the over-grouping smell. The few numeric thresholds (fill floor, overlap epsilon) come from geometric first principles, sanity-checked by eyeballing known cases (FIA 72-region should trip; a clean body paragraph shouldn't). Corpus = examples to look at, never the statistical ground truth.
D5 — W3 covers BOTH locator kinds. Pixel checks (above) AND text-range checks (range in-bounds, range-length vs chunk-content-length, no cross-chunk range overlap). W1 ships text-ranges this round, so the eval covers them from day one.
D6 — W3 fill signal = stamp the true fill-ratio at ingest. The precise box-vs-word-ink ratio needs the raw per-word boxes, but a chunk stores only the clustered union regions, not the underlying word boxes. Compute the ratio at ingest where the word boxes are still in hand and store it; the eval reads it. Rides D3's ingest-stamping work.
D7 — W3 is a warning report first, not a hard gate. Print it as a scorecard dimension (flagged-chunk list + per-issue tallies); don't fail the build. Trust layer, off the critical path; promote to a hard CI gate once thresholds prove stable across the corpus.
D8 — W2 detection = spike first. A wave-1 spike on the PDF-vision path to detect a splittable list/table from layout signals (runs of short paragraphs, bullet/number prefixes, x-aligned word columns) on the real FIA doc, BEFORE committing the chunker change — de-risks the heuristic and the over-chunking risk. Mirrors the S2 (markdown) table-chunking spike that worked. W3's scorecard verifies the effect.

Decisions — S3 red-team (2026-06-17, converted-doc source view)

Locked at the S3 red-team, grounded against the real repo (parse-docx.ts / parse-xlsx.ts, structure.ts, extractor.ts writeExtraction, the chunks schema, DocPanel.tsx, the retrieval enrich layer). These extend D1–D3 with build-level specifics for S3 + S4.

D9 — anchor on the paragraph stream, not "markdown" (corrects the W1 premise; see F3). docx has no markdown intermediate (paragraph JSON only); xlsx alone produces <slug>.md. The one artifact present for both types is the paragraph stream, so the canonical per-page source is built from it — one code path for both.
D10 — ONE canonical per-page source string, built AND measured in the same write-time pass. Three non-matching "source strings" exist in the repo today: pages.parsed_text (single-newline join, headings in, boilerplate out), eval/span.ts buildSourceModel (body-only, double-newline), and the xlsx <slug>.md (real markdown). Any offset computed against one mis-highlights when another is rendered. Resolution: a single function builds the canonical string AND stamps every chunk's char-range into it in one pass at ingest; persist both; the inspector renders exactly that string. Never recompute offsets at read time. (Chosen over "xlsx uses its .md, docx reconstructs" = two coordinate systems, and over "store a snippet only" = no page context.)
D11 — highlight the chunk's ACTUAL covered paragraph spans (a list of ranges), not a single min..max range. spanForParagraphKeys returns min..max, which visibly over-highlights a non-contiguous grouping (paints paragraphs the chunk doesn't contain). The covered ids are in hand at write time, so storing the real spans is cheap and always correct — fixes L2 at the source, not merely as an eval flag (S4).
D12 — deliver the source the way PDFs deliver page images: small locator inline, heavy source fetched on demand. PDFs ride bbox coordinates inline on the hit and fetch the page PNG on demand via /api/cache/[...path]. Mirror it: the char-range (small) rides inline on the hit/citation; the full per-page canonical source is written to the cache store at ingest (like the page PNGs and the xlsx .md) and fetched on demand through the existing /api/cache route — zero new routes, no chat-payload bloat. (Chosen over a new DB-backed route, and over inlining the full source on every hit.)
L1 confirmed IN S3 scope (not deferred): xlsx row chunks already exist (S2), so the row-chunk char-range — it must land on its own table row — is a build-and-verify item this story, tested on the real Slate pricing sheet, not a hypothetical.
Inspector fallback (pre-re-ingest): the text-range locator is optional end-to-end; a converted chunk with no locator yet renders the source with no highlight (or the current state), never a hard blank. Re-ingest of existing converted docs is non-urgent and rides the bundled deploy.

Story split & sequencing

Four stories, dependency-ordered (locked at red-team). W3-pixel leads so the measurement instrument exists before the fix; W1 + W3-text follow. (Status 2026-06-17: S1 done; S2 redirected from list/table-granularity to figure-bbox precision and substantially done — see the W2 annotations; S3 is next, S4 follows.)

S1 — W3 pixel-sanity scorecard. Gold-free evaluateStructuralSanityKb in the eval harness (pure-scorer + DB-half + run-retrieval.ts print, mirroring evaluateStructuredFilterKb). Pixel invariants from D4 over chunks.bbox. Independent of W1/W2 — immediate regression guard (would have caught both 6/16 bugs). Sequencing note: S1 ships the region-only invariants (out-of-bounds, cross-chunk overlap, region-count, area-vs-content proxy) with no ingest change; the precise fill-ratio stamp (D6) rides the first story that touches ingest (S2 or S3) and tightens the tightness check then.
S2 — W2 PDF list/table granularity. Spike the detection heuristic (D8) on the FIA doc, then split over-grouped lists/tables on the vision path. The PDF-vision analogue of the S2 markdown table-awareness (markdown path today). Verified by S1's scorecard (region count ↓, fill ↑) — closed loop. (Redirected at the 6/17 cross-corpus read: no list/table over-grouping exists; the real debt was figure-bbox over-sizing — see the W2 annotations.)
S3 — W1 converted-doc source view (acceptance refined at the S3 red-team — D9–D12). (1) At ingest, build the canonical per-page source string from the paragraph stream and stamp each chunk's covered paragraph spans into it in the same writeExtraction pass; persist the string to the cache store and the spans on the chunk (schema change). (2) Thread the text-range locator through the retrieval enrich layer → EnrichedHit → Citation, keyed by sourceType (D1). (3) DocPanel gains a real sourceType branch (today it forks on hasImage): PDF → image + bbox; converted → fetch the per-page source via /api/cache and highlight the covered spans (D2/D11/D12). (4) Re-ingest existing converted docs; the locator is optional — absent → render the source with no highlight, never a hard blank. (5) Verify a row chunk's range lands on its own row on the real Slate sheet (L1). Verified by S4's scorecard.
S4 — W3 text-range checks. Extend the scorecard with the text-range invariants (D5) once W1 emits locators: range in-bounds (end ≤ source length); range covers the chunk's paragraphs (NOT exact length-match — chunk content ≠ source substring due to separators + stripHeadingPrefix); no cross-chunk range overlap. The covered-spans model (D11) should make over-cover structurally impossible — this stays as the guard.

Red-team each story's acceptance criteria when it's picked up — the above is epic-level scoping, not story-level.

W1. Source view for converted docs (docx/xlsx) — markdown + char-span highlight

Every converted doc reaches the chunker as a paragraph stream; the deterministic chunker splits it, and both routes reference their paragraphs by paragraph_ids (the prose/LLM route picks which IDs group; code computes the rest).

Approach (locked D1–D3, refined by the S3 red-team 2026-06-17 → D9–D12): at ingest, build ONE canonical per-page source string from the paragraph stream and stamp each chunk's covered paragraph spans into that exact string in the same writeExtraction pass; persist both. The inspector renders that string as preformatted text and highlights the chunk's covered spans (D2/D11). The "bbox" for a converted doc becomes a text-range locator keyed by the existing sourceType (D1).
F1 (correction, 6/16): NOT "no new infra" — it needs a schema change (store the per-page source + per-chunk range) and a re-ingest of existing converted docs. The accurate claim is no new rendering infra / no LibreOffice.
F2 (correction, 6/16): the earlier "code-route knows markdown spans; llm-route maps via paragraph IDs" is wrong — both routes map via paragraph_ids; the mechanism is symmetric.
F3 (correction, S3 red-team 6/17): the earlier "every docx/xlsx normalizes to a markdown representation during ingest" is wrong against the repo. Only xlsx produces a markdown artifact — parse-xlsx.ts writes <slug>.md to the cache store today, already "so the inspector can show the converted source." docx emits per-chapter paragraph JSON directly (parse-docx.ts); there is no markdown intermediate. The one artifact present for both types is the paragraph stream, so W1 anchors there, not on "markdown." → D9.
What already exists vs missing: paragraph_ids are on every grouping, and writeExtraction already holds them + the paragraph map when it builds bboxes (extractor.ts) — so a char-range can be stamped in the same loop. pages.parsed_text is already written per page at ingest (a paragraph-text join) — a near-miss for the canonical string, but its exact form is one of three non-matching strings (see D10). Missing: a single offset-stable canonical string + the per-chunk spans, persisted. Do NOT reuse eval/span.ts buildSourceModel for display — it is body-only (drops headings/footnotes) with a different separator, so its offsets index a different string than the one rendered (the D3 body-only trap; D10 generalizes it to all three strings).
Rejected (for now): render docx/xlsx to a page image (LibreOffice) + pixel bboxes. Heavier infra; spreadsheets don't paginate cleanly. Revisit only as a fidelity upgrade if a doc's original visual layout carries meaning.

W2. PDF list/table chunk granularity

In scope this round (reconsidered in at red-team — measuring bad boxes without fixing the cause is half a loop). Long enumerated lists / tables on the PDF-vision path get lumped into one chunk (FIA C18.6 = 72 regions). This is now a three-way problem:

legibility — 72 scattered boxes make a "section" look like the whole page is covered;
retrieval — a 72-item chunk is too coarse to rank well;
ingestion-blocking (new — observed live 2026-06-17, GitHub #68) — a large table over-grouped into a single chunk exceeds the embedding model's input-token limit (~8192). OpenAI rejects the whole 64-chunk embed batch atomically, so finalize never sees all-embedded and retries every ~10 min forever → finalize DLQ → the doc never ingests. This escalates W2 from a quality concern to a correctness one for table-heavy technical docs.
Approach (locked, D8): spike the detection heuristic first. A wave-1 spike on the vision path — detect a splittable list/table from layout signals (runs of short paragraphs, bullet/number prefixes, x-aligned word columns) on the real FIA doc — before committing the chunker change. De-risks re-introducing over-chunking elsewhere; mirrors the S2 markdown table-chunking spike.
Then split over-grouped lists/tables into finer chunks on the vision path — the PDF-vision analogue of the markdown table-awareness (which covers only the markdown path today).
Verified by W3: the structural-sanity scorecard confirms the fix (region count ↓, fill ↑, no chunk over the embedding-token limit) — closed loop.
Scope boundary (GitHub #68): #68 splits the fix into two layers — layer 1, defensive embed robustness (ingestion/embed.ts: isolate/truncate the oversized input so the batch + doc still finalize) is the immediate beta-unblock, tracked in #68 outside this epic; layer 2 is this W2 (stop producing oversized chunks — the root cause). Don't pull the embed.ts fix into this epic; the W3 token-size canary is the shared early-warning for both.

W3. Bbox quality eval in the harness

Make bbox + text-range quality measurable so it can't silently regress (the trust layer needs a gate, not vibes). Slots into the eval harness as a new gold-free dimension (evaluateStructuralSanityKb), mirroring evaluateStructuredFilterKb: pure-scorer + DB-half + run-retrieval.ts print.

Metrics are invariants / self-consistency checks, NOT corpus baselines (D4):
- region/out-of-bounds: any coordinate outside [0,1] → zero tolerance;
- cross-chunk overlap: two chunks claiming the same pixels → should be ~0, flag any meaningful overlap (epsilon for rounding) — encodes the already-shipped column fix as an invariant;
- tightness/fill: box area vs the actual word-ink inside it (D6 stamps the true ratio at ingest) → flag loose boxes covering whitespace (the over-grouping symptom);
- fragmentation/region-count: one chunk with many scattered regions (the 72-region case) → the over-grouping smell;
- chunk token size (new — the canary for the 2026-06-17 ingestion failure): flag chunks whose token count approaches/exceeds the embedder's input limit (~8192). Gold-free, needs only the chunk content (no bbox), so it lands in S1 — and it's the cheapest early-warning that an over-grouped chunk will hard-fail embedding before W2 fixes the root cause.
Covers both locator kinds (D5): text-range analogues are range-in-bounds, range-length vs chunk-content-length (catches over-covering, incl. the non-contiguous-paragraph_ids case), and no cross-chunk range overlap.
Report-first, not a hard gate (D7): print flagged chunks + per-issue tallies; promote to a CI gate once thresholds are stable.
Numeric thresholds (fill floor, overlap epsilon, token-size ceiling) from geometric/model first principles, sanity-checked by eyeballing known cases — corpus = examples, not ground truth. IoU-vs-gold on a curated set stays deferred unless the cheap metrics stop discriminating.
Fits the closed-loop eval pattern: change the chunker (W2) → re-run the sanity scorecard → confirm the fix / no regression.

Roadmap Fit

This is a quality / trust-layer epic, NOT on the Brehob go-live critical path (brehob-launch: item 1 ingestion-foundation → 5a vend brehob-prod → curated corpus → item 4 QuoteAI vertical → 5b SSO+hardening → UAT → go-live). Brehob go-live does not strictly require W1–W3.

Next up is NOT this epic — next is the deploy + cloud E2E, then the Brehob critical-path items.
Caveat (2026-06-17): W2's over-grouping has an ingestion-blocking dimension — a large-table chunk exceeded the embedder's token limit on a live FIA upload, so the doc failed to ingest. If the Brehob corpus (8,618 .doc files, equipment/technical docs) is table-heavy, this can intersect the critical-path corpus load (item 1 → curated corpus). So W2 isn't purely polish — flag it for the corpus-load go/no-go, and pull S1+S2 earlier if a curated-corpus dry run trips the token-size canary.
Sequence as a parallel quality investment otherwise, prioritized when inspector polish rises (before broader beta, or when a customer's corpus is docx/xlsx-heavy and source legibility for those types becomes load-bearing). The two acute fixes already shipped buy the runway to schedule the rest deliberately.
Scoped + story-split 2026-06-16 (red/blue-team) — see Decisions and Story split. Build the stories in order (S1→S4) when the epic is picked up.

Open questions — resolved by the 2026-06-16 red-team

The original open questions are resolved (see Decisions):

W1 offset stamping (where, and how prose groupings map to spans) → D3: at ingest, into the full normalized markdown; both routes via paragraph_ids → the existing spanForParagraphKeys logic.
W2 detection heuristic (without re-introducing over-chunking) → D8: spike first; specifics settled in the spike.
W3 thresholds (what's an "outlier" / "oversized") → D4: invariant-based, geometric first principles, not corpus-calibrated; IoU-vs-gold still deferred.
Cross-cutting source model (unify vs two modes) → D1: two optional locators keyed by the existing sourceType.

Build-time items (small, non-blocking):

L1: confirm an xlsx line-item row chunk's char-range lands on the correct markdown table row (S2 per-row chunks carry meta.cells).
L2: non-contiguous LLM paragraph_ids would make a min..max range over-cover — low risk (groupings are contiguous runs); D5's range-length check catches it.
S1/D6 sequencing: S1 ships region-only invariants + an area-vs-content proxy; the precise fill-ratio stamp (D6) rides the first story that touches ingest (S2 or S3).

Acute fixes + diagnosis: decisions.md (autri-platform), 2026-06-16 "local-app shakedown" entry.
Concrete prod failure: GitHub issue #68 — oversized chunk (>8192 tokens) fails its whole embed batch → doc stuck, never finalizes (found in the 2026-06-17 post-deploy cloud E2E on the FIA regs). This epic owns the chunking root cause (W2, layer 2); the defensive embed-robustness fix (layer 1, ingestion/embed.ts) is tracked in #68, outside this epic.
Separate (doc-management, not legibility): silent doc-delete failure + re-upload dedup — tracked as its own task, out of scope here.

Epic: Source Legibility & Bbox Trust#

Why this epic#

Already shipped (2026-06-16, context — not part of the remaining scope)#

Decisions (locked — red-team 2026-06-16)#

Decisions — S3 red-team (2026-06-17, converted-doc source view)#

Story split & sequencing#

W1. Source view for converted docs (docx/xlsx) — markdown + char-span highlight#

W2. PDF list/table chunk granularity#

W3. Bbox quality eval in the harness#

Roadmap Fit#

Open questions — resolved by the 2026-06-16 red-team#

Related#

Review