Foundry Foundry

Unified Chunking & the Markdown Representation Layer

Sub-system design doc. How Autri turns ANY document — structured, prose, or verse — into retrievable chunks through ONE code path over a canonical, bbox-carrying intermediate representation (IR), with markdown as an input adapter and a derived display/substrate view — never the thing chunking reads.

Status: DRAFT for red-team. Created 2026-06-03 from the prose/markdown spike (see status/next.md and sub-systems/pipeline-eval-harness.md). Three doc classes validated end-to-end this session; the open questions below are the red-team targets.


Architecture

Overview

Risks & Constraints

RiskLikelihoodImpactMitigation
Semantic grouping doesn't beat naive sentence-windowing on retrieval, yet costs embedding-at-structure-time complexityMediumHigh — could be wasted architectureA/B grouping vs windowing on prose gold BEFORE wiring; gate the pipeline change on the result (Bucket 2)
Embedding-at-structure-time reorders the pipeline + ~2× embed cost on prose docsHighMediumEmbed once at the retrieval unit where possible; measure on real docs; prose-only path
Paragraph segmentation collapses indent-separated PDF prose into one page-blobConfirmed (NASA PDF: 24 "paras" for 20 pages)High — starves the chunkerIndent/sentence-aware paragraph splitting before prose chunking is trusted (Bucket 3)
Markdown-as-canonical silently drops bbox provenance → inspector overlay breaksConfirmed (0/41 chunks kept bbox via literal-markdown path)HighArchitecture commits IR canonical; markdown derived, never the chunker's input
Prose retrieval noisier than authored-structure (character-name flooding, diffuse facts)Confirmed (0.80 vs 1.0 recall@10)MediumHybrid lexical+vector retrieval + prose-gold passage-set anchoring (deferred follow-ons)

Current Status

CapabilityStatus
Canonical IR (PageParagraphs: text + bbox + structure)Shipped
Structured docs → deterministic section chunk on the IRShipped (default on; Constitution, FIA/STEM regs)
Markdown as an input adapter (parse-markdown.ts)Shipped (Constitution)
Prose → deterministic semantic chunk (semantic-chunk.ts)Prototyped — not wired; PROSE falls to the LLM route in prod
Verse → deterministic authored-boundary chunkPrototyped — not wired; FLAT-VERSE falls to the LLM route in prod
bbox-preserving tag-in-place chunking (Path 2)Proven at IR level — not wired
Markdown as a derived output / display / substrate view + provenancePlanned (this doc, Bucket 4)
Hybrid lexical+vector retrievalPlanned (deferred follow-on)

The Story

Autri's chunking started as an LLM grouping pass (an LLM decides chunk boundaries). The pipeline-eval harness then let us measure alternatives, and a sequence of results reframed the architecture:

  1. Structured docs don't need the LLM. Their boundaries are printed on the page (numbering geometry / authored headings). segment.ts reads them deterministically → recall@10 1.0, MRR up, chunk count down, $0. (Shipped.)
  2. "Markdown-first" reframe — resolved by measurement. Markdown is the right normalized form for authored-structure docs, but produced by code, not an LLM (an LLM PDF→markdown pass is slower/costlier/worse for text-layer docs). Chunking on authored boundaries was the win — markdown was the encoding of those boundaries, not the cause.
  3. Prose (this session). Prose has no authored boundaries, so structure must be inferred: embed paragraphs → group at adjacency-similarity valleys → cap size. Validated on a novel (recall@10 0.80; misses diagnosed as real prose characteristics, not chunker bugs).
  4. The bbox finding (this session). Serializing prose to literal markdown text and re-parsing it drops bbox (0/41 chunks kept geometry on a real PDF). Tagging the existing IR in place keeps it (41/41). This is the load-bearing architectural fork: markdown belongs at the edges, not the middle.
  5. The IR was the unification all along. The "one shape for every doc class" is the tagged IR (PageParagraphs with a section_id per paragraph feeding one chunkDeterministically) — not markdown. Markdown is a derived projection of the IR.

What Is This Sub-system?

This sub-system owns the path from a parsed document to retrievable chunks, plus the canonical representation that path operates on. Its thesis: a single intermediate representation (the IR) carries text + page geometry (bbox) + structure for every document; per-class boundary detection tags semantic units onto that IR; and one deterministic chunker turns the tagged IR into chunks. The LLM is the fallback (geometry-hostile / scanned docs only), not the engine. Markdown is two things at the edges of this sub-system — an input adapter (when a doc arrives as markdown) and a derived display/substrate view (the legible artifact the inspector, MCP, and QuoteAI consume) — but it is never the canonical store and never what the chunker reads, because markdown text cannot carry bbox.

Validation Evidence — three classes, one path

All three classes validated this session through the unified IR-tagging path (chunkDeterministically), $0 LLM:

ClassBoundary sourceEmbeddings?Result
Structured (US Constitution)authored numbering / headingsNorecall@10 1.0, $0, green
Verse (Genesis, WEB)authored verse numbersNo1524 verses → 129 chunks, 129/129 keep bbox, tight sizes (med 1453 char)
Prose (novel, ~88k words)inferred (similarity valleys)Yesrecall@10 0.80; runaway chunk killed; misses = real prose traits

Key cross-class finding: authored boundaries (structured + verse) need zero embeddings; only prose needs semantic grouping. This directly shapes Bucket 1/2.


Architecture Diagram

  source PDF / docx / md ──parse──▶  IR  (PageParagraphs: text + bbox + structure)
   (kept: "view original")            │  ◀── CANONICAL (single source of truth)
                                      │
                  boundary detection ─┤  tags meta.section_id onto each paragraph, by class:
                                      │    • structured-PDF  → numbering geometry (segment.ts)      [shipped]
                                      │    • structured-md    → ATX headings (parse-markdown.ts)      [shipped]
                                      │    • verse            → authored verse numbers (no embeds)    [proto]
                                      │    • prose            → semantic valleys (needs embeds)       [proto]
                                      │    • fallback         → LLM grouping (scanned/geometry-hostile)
                                      │
                                      ├──▶ chunks (text + bbox + section)  ──▶ embed ──▶ retrieval
                                      ├──▶ markdown VIEW (tables, mermaid)  ──▶ inspector / MCP / QuoteAI  [planned]
                                      └──▶ bbox overlay map                 ──▶ inspector highlight-on-source

One chunker (chunkDeterministically) consumes the tagged IR for ALL classes. Markdown and the overlay are derivations of the IR, not inputs to chunking.

System Boundary

Owns: the IR schema (PageParagraphs/Paragraph + bbox + meta.section_id); per-class boundary detection (segment.ts, parse-markdown.ts, semantic-chunk.ts, verse tagging); the deterministic chunker (chunkDeterministically); the route decision (route.ts); the derived markdown view + provenance map (planned).

Outside the boundary: raw parsing/rendering (parse.ts, render.ts); the embedder (embed.ts); retrieval/ranking (@autri/retrieval); the inspector UI; the MCP servers. These are consumers/producers that interface with the IR but aren't part of the chunking decision.

Key Interfaces

InterfaceTypeConsumers
PageParagraphs (the IR, with bbox + meta.section_id)Cached JSON artifactchunker, markdown view, bbox overlay, inspector
toTaggedParagraphs(pages)TaggedParagraph[]Functionthe single chunker entry point
chunkDeterministically(tagged)PageGroupingFunctionwriteExtraction (chunk rows + bbox)
routeUnit({sectionAware, docClass, coverage})Functionextractor — code vs LLM fallback
markdown view (derived) + provenance mapArtifact (planned)inspector display, MCP, QuoteAI

Design Buckets (Scope)

The work splits into four buckets, ordered by how much design (vs wiring) each needs.

Bucket 1 — Wire deterministic prose + verse onto the IR (mostly wiring; one real snag). Route PROSE and FLAT-VERSE to code (tag the IR in place → chunkDeterministically), making the LLM a true fallback. Verse is free (authored boundaries, no embeddings). The snag: prose semantic grouping needs paragraph embeddings, but the pipeline doesn't embed until after chunking (parse → structure → chunk → embed). Wiring prose requires either reordering (embed paragraphs at structure-time) or a dedicated grouping phase — an architectural change with a cost (prose docs embedded ~twice). Use Path 2 (tag IR in place), never the literal-markdown round-trip.

Bucket 2 — Prove semantic grouping earns its complexity (the decisive A/B). We proved valley-grouping makes plausible boundaries; we have not proven it beats a naive sentence-window-with-overlap baseline on retrieval. Windowing needs no embedding-at-structure-time (kills Bucket 1's snag). Commit to A/B grouping vs windowing on the prose gold before paying for the embedding-phase change. If windowing ties, prose collapses to the verse-simple case.

Bucket 3 — Paragraph granularity (net-new). structurePage's gap-based splitter collapses indent-separated PDF prose into one blob per page (confirmed on the NASA PDF). Prose chunking needs finer units: indent-aware paragraph splitting and/or sentence segmentation. Gates prose-PDF quality regardless of Bucket 2.

Bucket 4 — The markdown derived-view + provenance (net-new; the substrate play). Serialize the IR → legible markdown (native tables, mermaid for diagrams) as the display/MCP/QuoteAI artifact, with a provenance map so each markdown span highlights its bbox region on the source. The strategic vision.md / platform-thesis piece. Touches no chunking; this is the "what do we do with markdown" half.


Open Questions / Red-Team Targets

  1. Bucket 2 is the keystone: does semantic grouping beat naive sentence-windowing on retrieval? If not, most of Bucket 1's complexity evaporates. Red-team should demand this A/B before any pipeline reorder.
  2. Where does embedding-at-structure-time live, and is the ~2× prose embed cost acceptable? Can we embed once at the retrieval unit (feedback_retrieval_vs_display_granularity)?
  3. Prose retrieval is noisier than authored-structure (character-name flooding, facts diffuse across passages). Is hybrid lexical+vector retrieval in-scope here or a separate retrieval sub-system concern?
  4. Prose gold is chunker-specific (synthetic passage ids; re-chunk → regenerate). Should gold anchor to passage sets / should the harness give neighbor credit for prose? (pipeline-eval-harness.md.)
  5. Provenance representation (Bucket 4): how does a markdown span map back to bbox regions — inline anchors, a parallel offset→bbox map, or IR-paragraph ids? Tables/figures complicate it (markdown table cell ↔ source region).
  6. Markdown view scope: is it persisted, regenerated on demand, or both? Who owns it (this sub-system vs inspector)?
  7. Negative/precision queries are unscoreable under the current harness (no similarity threshold; logged in pipeline-eval-harness.md). Out of scope here but blocks precision claims.

To be created from this doc after red/blue-team. Candidates: "Wire deterministic prose+verse routing" (Bucket 1+3), "Markdown view + provenance" (Bucket 4).

EpicDocStatusSummary
(TBD — prose/verse routing)PlannedBucket 1 + 3: route non-structured to code, indent-aware paragraphs
(TBD — markdown view)PlannedBucket 4: derived markdown + provenance substrate

Cross-Cutting Concerns

ConcernHow This Sub-system Is Affected
bbox provenanceThe reason the IR is canonical; every derivation (chunks, overlay, markdown view) must preserve or re-link to it
Cost / Max-billed LLMAuthored-boundary classes are $0; prose adds an embedding pass — measure before committing (Bucket 2)
Eval harnessAll design changes are gated by retrieval recall on per-class gold (pipeline-eval-harness.md, eval-corpus-and-doe.md)
Inspector (product wedge)Markdown view + bbox overlay are how trust is shown; a markdown-canonical design would break the overlay
MCP / QuoteAI substrateThe derived markdown view is the external representation these consume (vision.md)

Decisions Log

DateDecisionRationaleAlternatives Considered
2026-06-03IR (PageParagraphs + bbox) is canonical; markdown is a derived view, never the chunker's inputLiteral-markdown round-trip drops bbox (0/41 chunks kept geometry on a real PDF); the IR already holds text+bbox+structure togetherMarkdown-canonical + bbox sidecar — rejected: reconstructs the IR with extra steps
2026-06-03One chunker (chunkDeterministically) over the tagged IR for all classesUnification is the tagged IR, not markdown; proven on structured/verse/prosePer-class chunkers; LLM-for-all (frozen baseline)
2026-06-03Authored-boundary classes (structured, verse) chunk with zero embeddings; only prose needs semantic groupingVerse numbers / numbering are printed boundaries; measured $0 + bbox-preserving on Genesis & ConstitutionTreat verse as prose (semantic grouping) — unnecessary cost
2026-06-03Markdown stays at the edges: input adapter + derived display/substrate viewThe chunking win is boundary-awareness, not markdown; markdown's value is legibility + external representationMarkdown as the pipeline intermediary — rejected (bbox loss, redundant re-parse)
2026-06-03 (proposed)A/B semantic grouping vs sentence-windowing before wiring proseAvoid paying embedding-at-structure-time cost if windowing ties on retrievalWire grouping directly (assumes it wins — unproven)

Known Issues / Tech Debt

IssueSeverityNotes
structurePage collapses indent-separated PDF prose into page-blobsHighBucket 3; confirmed on NASA PDF (24 paras / 20 pages)
Prose gold is chunker-specific (synthetic ids)MediumRe-chunk → regenerate; consider passage-set anchoring / neighbor credit
Pure-vector retrieval floods on salient entity tokens in proseMedium"who does X marry" pulls every passage naming X; needs hybrid lexical+vector
Negative/precision queries unscoreableMediumHarness has no similarity threshold (pipeline-eval-harness.md:retrieval-db.ts:44)
Chapter/parent-heading text chunks (e.g. "Chapter 1") embedded as low-value chunksLowPre-existing behavior shared with the Constitution path

Sub-system docs define architectural boundaries. This one defines how every doc class becomes retrievable chunks over the canonical IR, and where markdown lives (edges, not middle). Next: /hl:red-team this doc, then /hl:blue-team to scope, then epics.

Review

🔒

Enter your access token to view annotations