Unified Chunking & the Markdown Representation Layer

Sub-system design doc. How Autri turns ANY document — structured, prose, or verse — into retrievable chunks through ONE code path over a canonical, bbox-carrying intermediate representation (IR), with markdown as an input adapter and a derived display/substrate view — never the thing chunking reads.

Status: DRAFT for red-team. Created 2026-06-03 from the prose/markdown spike (see status/next.md and sub-systems/pipeline-eval-harness.md). Three doc classes validated end-to-end this session; the open questions below are the red-team targets.

Architecture

Overview

Risks & Constraints

Risk	Likelihood	Impact	Mitigation
Semantic grouping doesn't beat naive sentence-windowing on retrieval, yet costs embedding-at-structure-time complexity	Medium	High — could be wasted architecture	A/B grouping vs windowing on prose gold BEFORE wiring; gate the pipeline change on the result (Bucket 2)
Embedding-at-structure-time reorders the pipeline + ~2× embed cost on prose docs	High	Medium	Embed once at the retrieval unit where possible; measure on real docs; prose-only path
Paragraph segmentation collapses indent-separated PDF prose into one page-blob	Confirmed (NASA PDF: 24 "paras" for 20 pages)	High — starves the chunker	Indent/sentence-aware paragraph splitting before prose chunking is trusted (Bucket 3)
Markdown-as-canonical silently drops bbox provenance → inspector overlay breaks	Confirmed (0/41 chunks kept bbox via literal-markdown path)	High	Architecture commits IR canonical; markdown derived, never the chunker's input
Prose retrieval noisier than authored-structure (character-name flooding, diffuse facts)	Confirmed (0.80 vs 1.0 recall@10)	Medium	Hybrid lexical+vector retrieval + prose-gold passage-set anchoring (deferred follow-ons)

Current Status

Capability	Status
Canonical IR (`PageParagraphs`: text + bbox + structure)	Shipped
Structured docs → deterministic section chunk on the IR	Shipped (default on; Constitution, FIA/STEM regs)
Markdown as an input adapter (`parse-markdown.ts`)	Shipped (Constitution)
Prose → deterministic semantic chunk (`semantic-chunk.ts`)	Prototyped — not wired; PROSE falls to the LLM route in prod
Verse → deterministic authored-boundary chunk	Prototyped — not wired; FLAT-VERSE falls to the LLM route in prod
bbox-preserving tag-in-place chunking (Path 2)	Proven at IR level — not wired
Markdown as a derived output / display / substrate view + provenance	Planned (this doc, Bucket 4)
Hybrid lexical+vector retrieval	Planned (deferred follow-on)

The Story

Autri's chunking started as an LLM grouping pass (an LLM decides chunk boundaries). The pipeline-eval harness then let us measure alternatives, and a sequence of results reframed the architecture:

Structured docs don't need the LLM. Their boundaries are printed on the page (numbering geometry / authored headings). segment.ts reads them deterministically → recall@10 1.0, MRR up, chunk count down, $0. (Shipped.)
"Markdown-first" reframe — resolved by measurement. Markdown is the right normalized form for authored-structure docs, but produced by code, not an LLM (an LLM PDF→markdown pass is slower/costlier/worse for text-layer docs). Chunking on authored boundaries was the win — markdown was the encoding of those boundaries, not the cause.
Prose (this session). Prose has no authored boundaries, so structure must be inferred: embed paragraphs → group at adjacency-similarity valleys → cap size. Validated on a novel (recall@10 0.80; misses diagnosed as real prose characteristics, not chunker bugs).
The bbox finding (this session). Serializing prose to literal markdown text and re-parsing it drops bbox (0/41 chunks kept geometry on a real PDF). Tagging the existing IR in place keeps it (41/41). This is the load-bearing architectural fork: markdown belongs at the edges, not the middle.
The IR was the unification all along. The "one shape for every doc class" is the tagged IR (PageParagraphs with a section_id per paragraph feeding one chunkDeterministically) — not markdown. Markdown is a derived projection of the IR.

What Is This Sub-system?

This sub-system owns the path from a parsed document to retrievable chunks, plus the canonical representation that path operates on. Its thesis: a single intermediate representation (the IR) carries text + page geometry (bbox) + structure for every document; per-class boundary detection tags semantic units onto that IR; and one deterministic chunker turns the tagged IR into chunks. The LLM is the fallback (geometry-hostile / scanned docs only), not the engine. Markdown is two things at the edges of this sub-system — an input adapter (when a doc arrives as markdown) and a derived display/substrate view (the legible artifact the inspector, MCP, and QuoteAI consume) — but it is never the canonical store and never what the chunker reads, because markdown text cannot carry bbox.

Validation Evidence — three classes, one path

All three classes validated this session through the unified IR-tagging path (chunkDeterministically), $0 LLM:

Class	Boundary source	Embeddings?	Result
Structured (US Constitution)	authored numbering / headings	No	recall@10 1.0, $0, green
Verse (Genesis, WEB)	authored verse numbers	No	1524 verses → 129 chunks, 129/129 keep bbox, tight sizes (med 1453 char)
Prose (novel, ~88k words)	inferred (similarity valleys)	Yes	recall@10 0.80; runaway chunk killed; misses = real prose traits

Key cross-class finding: authored boundaries (structured + verse) need zero embeddings; only prose needs semantic grouping. This directly shapes Bucket 1/2.

Architecture Diagram

  source PDF / docx / md ──parse──▶  IR  (PageParagraphs: text + bbox + structure)
   (kept: "view original")            │  ◀── CANONICAL (single source of truth)
                                      │
                  boundary detection ─┤  tags meta.section_id onto each paragraph, by class:
                                      │    • structured-PDF  → numbering geometry (segment.ts)      [shipped]
                                      │    • structured-md    → ATX headings (parse-markdown.ts)      [shipped]
                                      │    • verse            → authored verse numbers (no embeds)    [proto]
                                      │    • prose            → semantic valleys (needs embeds)       [proto]
                                      │    • fallback         → LLM grouping (scanned/geometry-hostile)
                                      │
                                      ├──▶ chunks (text + bbox + section)  ──▶ embed ──▶ retrieval
                                      ├──▶ markdown VIEW (tables, mermaid)  ──▶ inspector / MCP / QuoteAI  [planned]
                                      └──▶ bbox overlay map                 ──▶ inspector highlight-on-source

One chunker (chunkDeterministically) consumes the tagged IR for ALL classes. Markdown and the overlay are derivations of the IR, not inputs to chunking.

System Boundary

Owns: the IR schema (PageParagraphs/Paragraph + bbox + meta.section_id); per-class boundary detection (segment.ts, parse-markdown.ts, semantic-chunk.ts, verse tagging); the deterministic chunker (chunkDeterministically); the route decision (route.ts); the derived markdown view + provenance map (planned).

Outside the boundary: raw parsing/rendering (parse.ts, render.ts); the embedder (embed.ts); retrieval/ranking (@autri/retrieval); the inspector UI; the MCP servers. These are consumers/producers that interface with the IR but aren't part of the chunking decision.

Key Interfaces

Interface	Type	Consumers
`PageParagraphs` (the IR, with bbox + `meta.section_id`)	Cached JSON artifact	chunker, markdown view, bbox overlay, inspector
`toTaggedParagraphs(pages)` → `TaggedParagraph[]`	Function	the single chunker entry point
`chunkDeterministically(tagged)` → `PageGrouping`	Function	`writeExtraction` (chunk rows + bbox)
`routeUnit({sectionAware, docClass, coverage})`	Function	extractor — code vs LLM fallback
markdown view (derived) + provenance map	Artifact (planned)	inspector display, MCP, QuoteAI

Design Buckets (Scope)

The work splits into four buckets, ordered by how much design (vs wiring) each needs.

Bucket 1 — Wire deterministic prose + verse onto the IR (mostly wiring; one real snag). Route PROSE and FLAT-VERSE to code (tag the IR in place → chunkDeterministically), making the LLM a true fallback. Verse is free (authored boundaries, no embeddings). The snag: prose semantic grouping needs paragraph embeddings, but the pipeline doesn't embed until after chunking (parse → structure → chunk → embed). Wiring prose requires either reordering (embed paragraphs at structure-time) or a dedicated grouping phase — an architectural change with a cost (prose docs embedded ~twice). Use Path 2 (tag IR in place), never the literal-markdown round-trip.

Bucket 2 — Prove semantic grouping earns its complexity (the decisive A/B). We proved valley-grouping makes plausible boundaries; we have not proven it beats a naive sentence-window-with-overlap baseline on retrieval. Windowing needs no embedding-at-structure-time (kills Bucket 1's snag). Commit to A/B grouping vs windowing on the prose gold before paying for the embedding-phase change. If windowing ties, prose collapses to the verse-simple case.

Bucket 3 — Paragraph granularity (net-new). structurePage's gap-based splitter collapses indent-separated PDF prose into one blob per page (confirmed on the NASA PDF). Prose chunking needs finer units: indent-aware paragraph splitting and/or sentence segmentation. Gates prose-PDF quality regardless of Bucket 2.

Bucket 4 — The markdown derived-view + provenance (net-new; the substrate play). Serialize the IR → legible markdown (native tables, mermaid for diagrams) as the display/MCP/QuoteAI artifact, with a provenance map so each markdown span highlights its bbox region on the source. The strategic vision.md / platform-thesis piece. Touches no chunking; this is the "what do we do with markdown" half.

Open Questions / Red-Team Targets

Bucket 2 is the keystone: does semantic grouping beat naive sentence-windowing on retrieval? If not, most of Bucket 1's complexity evaporates. Red-team should demand this A/B before any pipeline reorder.
Where does embedding-at-structure-time live, and is the ~2× prose embed cost acceptable? Can we embed once at the retrieval unit (feedback_retrieval_vs_display_granularity)?
Prose retrieval is noisier than authored-structure (character-name flooding, facts diffuse across passages). Is hybrid lexical+vector retrieval in-scope here or a separate retrieval sub-system concern?
Prose gold is chunker-specific (synthetic passage ids; re-chunk → regenerate). Should gold anchor to passage sets / should the harness give neighbor credit for prose? (pipeline-eval-harness.md.)
Provenance representation (Bucket 4): how does a markdown span map back to bbox regions — inline anchors, a parallel offset→bbox map, or IR-paragraph ids? Tables/figures complicate it (markdown table cell ↔ source region).
Markdown view scope: is it persisted, regenerated on demand, or both? Who owns it (this sub-system vs inspector)?
Negative/precision queries are unscoreable under the current harness (no similarity threshold; logged in pipeline-eval-harness.md). Out of scope here but blocks precision claims.

To be created from this doc after red/blue-team. Candidates: "Wire deterministic prose+verse routing" (Bucket 1+3), "Markdown view + provenance" (Bucket 4).

Epic	Doc	Status	Summary
(TBD — prose/verse routing)	—	Planned	Bucket 1 + 3: route non-structured to code, indent-aware paragraphs
(TBD — markdown view)	—	Planned	Bucket 4: derived markdown + provenance substrate

Cross-Cutting Concerns

Concern	How This Sub-system Is Affected
bbox provenance	The reason the IR is canonical; every derivation (chunks, overlay, markdown view) must preserve or re-link to it
Cost / Max-billed LLM	Authored-boundary classes are $0; prose adds an embedding pass — measure before committing (Bucket 2)
Eval harness	All design changes are gated by retrieval recall on per-class gold (`pipeline-eval-harness.md`, `eval-corpus-and-doe.md`)
Inspector (product wedge)	Markdown view + bbox overlay are how trust is shown; a markdown-canonical design would break the overlay
MCP / QuoteAI substrate	The derived markdown view is the external representation these consume (`vision.md`)

Decisions Log

Date	Decision	Rationale	Alternatives Considered
2026-06-03	IR (`PageParagraphs` + bbox) is canonical; markdown is a derived view, never the chunker's input	Literal-markdown round-trip drops bbox (0/41 chunks kept geometry on a real PDF); the IR already holds text+bbox+structure together	Markdown-canonical + bbox sidecar — rejected: reconstructs the IR with extra steps
2026-06-03	One chunker (`chunkDeterministically`) over the tagged IR for all classes	Unification is the tagged IR, not markdown; proven on structured/verse/prose	Per-class chunkers; LLM-for-all (frozen baseline)
2026-06-03	Authored-boundary classes (structured, verse) chunk with zero embeddings; only prose needs semantic grouping	Verse numbers / numbering are printed boundaries; measured $0 + bbox-preserving on Genesis & Constitution	Treat verse as prose (semantic grouping) — unnecessary cost
2026-06-03	Markdown stays at the edges: input adapter + derived display/substrate view	The chunking win is boundary-awareness, not markdown; markdown's value is legibility + external representation	Markdown as the pipeline intermediary — rejected (bbox loss, redundant re-parse)
2026-06-03 (proposed)	A/B semantic grouping vs sentence-windowing before wiring prose	Avoid paying embedding-at-structure-time cost if windowing ties on retrieval	Wire grouping directly (assumes it wins — unproven)

Known Issues / Tech Debt

Issue	Severity	Notes
`structurePage` collapses indent-separated PDF prose into page-blobs	High	Bucket 3; confirmed on NASA PDF (24 paras / 20 pages)
Prose gold is chunker-specific (synthetic ids)	Medium	Re-chunk → regenerate; consider passage-set anchoring / neighbor credit
Pure-vector retrieval floods on salient entity tokens in prose	Medium	"who does X marry" pulls every passage naming X; needs hybrid lexical+vector
Negative/precision queries unscoreable	Medium	Harness has no similarity threshold (`pipeline-eval-harness.md:retrieval-db.ts:44`)
Chapter/parent-heading text chunks (e.g. "Chapter 1") embedded as low-value chunks	Low	Pre-existing behavior shared with the Constitution path

Sub-system docs define architectural boundaries. This one defines how every doc class becomes retrievable chunks over the canonical IR, and where markdown lives (edges, not middle). Next: /hl:red-team this doc, then /hl:blue-team to scope, then epics.

Unified Chunking & the Markdown Representation Layer#

Architecture#

Overview#

Risks & Constraints#

Current Status#

The Story#

What Is This Sub-system?#

Validation Evidence — three classes, one path#

Architecture Diagram#

System Boundary#

Key Interfaces#

Design Buckets (Scope)#

Open Questions / Red-Team Targets#

Related Epics#

Cross-Cutting Concerns#

Decisions Log#

Known Issues / Tech Debt#

Review