Unified Chunking & the Markdown Representation Layer
Sub-system design doc. How Autri turns ANY document — structured, prose, or verse — into retrievable chunks through ONE code path over a canonical, bbox-carrying intermediate representation (IR), with markdown as an input adapter and a derived display/substrate view — never the thing chunking reads.
Status: DRAFT for red-team. Created 2026-06-03 from the prose/markdown spike (see status/next.md and sub-systems/pipeline-eval-harness.md). Three doc classes validated end-to-end this session; the open questions below are the red-team targets.
Architecture
Overview
Risks & Constraints
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Semantic grouping doesn't beat naive sentence-windowing on retrieval, yet costs embedding-at-structure-time complexity | Medium | High — could be wasted architecture | A/B grouping vs windowing on prose gold BEFORE wiring; gate the pipeline change on the result (Bucket 2) |
| Embedding-at-structure-time reorders the pipeline + ~2× embed cost on prose docs | High | Medium | Embed once at the retrieval unit where possible; measure on real docs; prose-only path |
| Paragraph segmentation collapses indent-separated PDF prose into one page-blob | Confirmed (NASA PDF: 24 "paras" for 20 pages) | High — starves the chunker | Indent/sentence-aware paragraph splitting before prose chunking is trusted (Bucket 3) |
| Markdown-as-canonical silently drops bbox provenance → inspector overlay breaks | Confirmed (0/41 chunks kept bbox via literal-markdown path) | High | Architecture commits IR canonical; markdown derived, never the chunker's input |
| Prose retrieval noisier than authored-structure (character-name flooding, diffuse facts) | Confirmed (0.80 vs 1.0 recall@10) | Medium | Hybrid lexical+vector retrieval + prose-gold passage-set anchoring (deferred follow-ons) |
Current Status
| Capability | Status |
|---|---|
Canonical IR (PageParagraphs: text + bbox + structure) | Shipped |
| Structured docs → deterministic section chunk on the IR | Shipped (default on; Constitution, FIA/STEM regs) |
Markdown as an input adapter (parse-markdown.ts) | Shipped (Constitution) |
Prose → deterministic semantic chunk (semantic-chunk.ts) | Prototyped — not wired; PROSE falls to the LLM route in prod |
| Verse → deterministic authored-boundary chunk | Prototyped — not wired; FLAT-VERSE falls to the LLM route in prod |
| bbox-preserving tag-in-place chunking (Path 2) | Proven at IR level — not wired |
| Markdown as a derived output / display / substrate view + provenance | Planned (this doc, Bucket 4) |
| Hybrid lexical+vector retrieval | Planned (deferred follow-on) |
The Story
Autri's chunking started as an LLM grouping pass (an LLM decides chunk boundaries). The pipeline-eval harness then let us measure alternatives, and a sequence of results reframed the architecture:
- Structured docs don't need the LLM. Their boundaries are printed on the page (numbering geometry / authored headings).
segment.tsreads them deterministically → recall@10 1.0, MRR up, chunk count down, $0. (Shipped.) - "Markdown-first" reframe — resolved by measurement. Markdown is the right normalized form for authored-structure docs, but produced by code, not an LLM (an LLM PDF→markdown pass is slower/costlier/worse for text-layer docs). Chunking on authored boundaries was the win — markdown was the encoding of those boundaries, not the cause.
- Prose (this session). Prose has no authored boundaries, so structure must be inferred: embed paragraphs → group at adjacency-similarity valleys → cap size. Validated on a novel (recall@10 0.80; misses diagnosed as real prose characteristics, not chunker bugs).
- The bbox finding (this session). Serializing prose to literal markdown text and re-parsing it drops bbox (0/41 chunks kept geometry on a real PDF). Tagging the existing IR in place keeps it (41/41). This is the load-bearing architectural fork: markdown belongs at the edges, not the middle.
- The IR was the unification all along. The "one shape for every doc class" is the tagged IR (
PageParagraphswith asection_idper paragraph feeding onechunkDeterministically) — not markdown. Markdown is a derived projection of the IR.
What Is This Sub-system?
This sub-system owns the path from a parsed document to retrievable chunks, plus the canonical representation that path operates on. Its thesis: a single intermediate representation (the IR) carries text + page geometry (bbox) + structure for every document; per-class boundary detection tags semantic units onto that IR; and one deterministic chunker turns the tagged IR into chunks. The LLM is the fallback (geometry-hostile / scanned docs only), not the engine. Markdown is two things at the edges of this sub-system — an input adapter (when a doc arrives as markdown) and a derived display/substrate view (the legible artifact the inspector, MCP, and QuoteAI consume) — but it is never the canonical store and never what the chunker reads, because markdown text cannot carry bbox.
Validation Evidence — three classes, one path
All three classes validated this session through the unified IR-tagging path (chunkDeterministically), $0 LLM:
| Class | Boundary source | Embeddings? | Result |
|---|---|---|---|
| Structured (US Constitution) | authored numbering / headings | No | recall@10 1.0, $0, green |
| Verse (Genesis, WEB) | authored verse numbers | No | 1524 verses → 129 chunks, 129/129 keep bbox, tight sizes (med 1453 char) |
| Prose (novel, ~88k words) | inferred (similarity valleys) | Yes | recall@10 0.80; runaway chunk killed; misses = real prose traits |
Key cross-class finding: authored boundaries (structured + verse) need zero embeddings; only prose needs semantic grouping. This directly shapes Bucket 1/2.
Architecture Diagram
source PDF / docx / md ──parse──▶ IR (PageParagraphs: text + bbox + structure)
(kept: "view original") │ ◀── CANONICAL (single source of truth)
│
boundary detection ─┤ tags meta.section_id onto each paragraph, by class:
│ • structured-PDF → numbering geometry (segment.ts) [shipped]
│ • structured-md → ATX headings (parse-markdown.ts) [shipped]
│ • verse → authored verse numbers (no embeds) [proto]
│ • prose → semantic valleys (needs embeds) [proto]
│ • fallback → LLM grouping (scanned/geometry-hostile)
│
├──▶ chunks (text + bbox + section) ──▶ embed ──▶ retrieval
├──▶ markdown VIEW (tables, mermaid) ──▶ inspector / MCP / QuoteAI [planned]
└──▶ bbox overlay map ──▶ inspector highlight-on-source
One chunker (chunkDeterministically) consumes the tagged IR for ALL classes. Markdown and the overlay are derivations of the IR, not inputs to chunking.
System Boundary
Owns: the IR schema (PageParagraphs/Paragraph + bbox + meta.section_id); per-class boundary detection (segment.ts, parse-markdown.ts, semantic-chunk.ts, verse tagging); the deterministic chunker (chunkDeterministically); the route decision (route.ts); the derived markdown view + provenance map (planned).
Outside the boundary: raw parsing/rendering (parse.ts, render.ts); the embedder (embed.ts); retrieval/ranking (@autri/retrieval); the inspector UI; the MCP servers. These are consumers/producers that interface with the IR but aren't part of the chunking decision.
Key Interfaces
| Interface | Type | Consumers |
|---|---|---|
PageParagraphs (the IR, with bbox + meta.section_id) | Cached JSON artifact | chunker, markdown view, bbox overlay, inspector |
toTaggedParagraphs(pages) → TaggedParagraph[] | Function | the single chunker entry point |
chunkDeterministically(tagged) → PageGrouping | Function | writeExtraction (chunk rows + bbox) |
routeUnit({sectionAware, docClass, coverage}) | Function | extractor — code vs LLM fallback |
| markdown view (derived) + provenance map | Artifact (planned) | inspector display, MCP, QuoteAI |
Design Buckets (Scope)
The work splits into four buckets, ordered by how much design (vs wiring) each needs.
Bucket 1 — Wire deterministic prose + verse onto the IR (mostly wiring; one real snag).
Route PROSE and FLAT-VERSE to code (tag the IR in place → chunkDeterministically), making the LLM a true fallback. Verse is free (authored boundaries, no embeddings). The snag: prose semantic grouping needs paragraph embeddings, but the pipeline doesn't embed until after chunking (parse → structure → chunk → embed). Wiring prose requires either reordering (embed paragraphs at structure-time) or a dedicated grouping phase — an architectural change with a cost (prose docs embedded ~twice). Use Path 2 (tag IR in place), never the literal-markdown round-trip.
Bucket 2 — Prove semantic grouping earns its complexity (the decisive A/B). We proved valley-grouping makes plausible boundaries; we have not proven it beats a naive sentence-window-with-overlap baseline on retrieval. Windowing needs no embedding-at-structure-time (kills Bucket 1's snag). Commit to A/B grouping vs windowing on the prose gold before paying for the embedding-phase change. If windowing ties, prose collapses to the verse-simple case.
Bucket 3 — Paragraph granularity (net-new).
structurePage's gap-based splitter collapses indent-separated PDF prose into one blob per page (confirmed on the NASA PDF). Prose chunking needs finer units: indent-aware paragraph splitting and/or sentence segmentation. Gates prose-PDF quality regardless of Bucket 2.
Bucket 4 — The markdown derived-view + provenance (net-new; the substrate play).
Serialize the IR → legible markdown (native tables, mermaid for diagrams) as the display/MCP/QuoteAI artifact, with a provenance map so each markdown span highlights its bbox region on the source. The strategic vision.md / platform-thesis piece. Touches no chunking; this is the "what do we do with markdown" half.
Open Questions / Red-Team Targets
- Bucket 2 is the keystone: does semantic grouping beat naive sentence-windowing on retrieval? If not, most of Bucket 1's complexity evaporates. Red-team should demand this A/B before any pipeline reorder.
- Where does embedding-at-structure-time live, and is the ~2× prose embed cost acceptable? Can we embed once at the retrieval unit (
feedback_retrieval_vs_display_granularity)? - Prose retrieval is noisier than authored-structure (character-name flooding, facts diffuse across passages). Is hybrid lexical+vector retrieval in-scope here or a separate retrieval sub-system concern?
- Prose gold is chunker-specific (synthetic passage ids; re-chunk → regenerate). Should gold anchor to passage sets / should the harness give neighbor credit for prose? (
pipeline-eval-harness.md.) - Provenance representation (Bucket 4): how does a markdown span map back to bbox regions — inline anchors, a parallel offset→bbox map, or IR-paragraph ids? Tables/figures complicate it (markdown table cell ↔ source region).
- Markdown view scope: is it persisted, regenerated on demand, or both? Who owns it (this sub-system vs inspector)?
- Negative/precision queries are unscoreable under the current harness (no similarity threshold; logged in
pipeline-eval-harness.md). Out of scope here but blocks precision claims.
Related Epics
To be created from this doc after red/blue-team. Candidates: "Wire deterministic prose+verse routing" (Bucket 1+3), "Markdown view + provenance" (Bucket 4).
| Epic | Doc | Status | Summary |
|---|---|---|---|
| (TBD — prose/verse routing) | — | Planned | Bucket 1 + 3: route non-structured to code, indent-aware paragraphs |
| (TBD — markdown view) | — | Planned | Bucket 4: derived markdown + provenance substrate |
Cross-Cutting Concerns
| Concern | How This Sub-system Is Affected |
|---|---|
| bbox provenance | The reason the IR is canonical; every derivation (chunks, overlay, markdown view) must preserve or re-link to it |
| Cost / Max-billed LLM | Authored-boundary classes are $0; prose adds an embedding pass — measure before committing (Bucket 2) |
| Eval harness | All design changes are gated by retrieval recall on per-class gold (pipeline-eval-harness.md, eval-corpus-and-doe.md) |
| Inspector (product wedge) | Markdown view + bbox overlay are how trust is shown; a markdown-canonical design would break the overlay |
| MCP / QuoteAI substrate | The derived markdown view is the external representation these consume (vision.md) |
Decisions Log
| Date | Decision | Rationale | Alternatives Considered |
|---|---|---|---|
| 2026-06-03 | IR (PageParagraphs + bbox) is canonical; markdown is a derived view, never the chunker's input | Literal-markdown round-trip drops bbox (0/41 chunks kept geometry on a real PDF); the IR already holds text+bbox+structure together | Markdown-canonical + bbox sidecar — rejected: reconstructs the IR with extra steps |
| 2026-06-03 | One chunker (chunkDeterministically) over the tagged IR for all classes | Unification is the tagged IR, not markdown; proven on structured/verse/prose | Per-class chunkers; LLM-for-all (frozen baseline) |
| 2026-06-03 | Authored-boundary classes (structured, verse) chunk with zero embeddings; only prose needs semantic grouping | Verse numbers / numbering are printed boundaries; measured $0 + bbox-preserving on Genesis & Constitution | Treat verse as prose (semantic grouping) — unnecessary cost |
| 2026-06-03 | Markdown stays at the edges: input adapter + derived display/substrate view | The chunking win is boundary-awareness, not markdown; markdown's value is legibility + external representation | Markdown as the pipeline intermediary — rejected (bbox loss, redundant re-parse) |
| 2026-06-03 (proposed) | A/B semantic grouping vs sentence-windowing before wiring prose | Avoid paying embedding-at-structure-time cost if windowing ties on retrieval | Wire grouping directly (assumes it wins — unproven) |
Known Issues / Tech Debt
| Issue | Severity | Notes |
|---|---|---|
structurePage collapses indent-separated PDF prose into page-blobs | High | Bucket 3; confirmed on NASA PDF (24 paras / 20 pages) |
| Prose gold is chunker-specific (synthetic ids) | Medium | Re-chunk → regenerate; consider passage-set anchoring / neighbor credit |
| Pure-vector retrieval floods on salient entity tokens in prose | Medium | "who does X marry" pulls every passage naming X; needs hybrid lexical+vector |
| Negative/precision queries unscoreable | Medium | Harness has no similarity threshold (pipeline-eval-harness.md:retrieval-db.ts:44) |
| Chapter/parent-heading text chunks (e.g. "Chapter 1") embedded as low-value chunks | Low | Pre-existing behavior shared with the Constitution path |
Sub-system docs define architectural boundaries. This one defines how every doc class becomes retrievable chunks over the canonical IR, and where markdown lives (edges, not middle). Next: /hl:red-team this doc, then /hl:blue-team to scope, then epics.