Foundry Foundry

Autri — Functional MVP Spec

Created 2026-05-15. Scopes the Functional MVP — the smallest version of Autri that validates "is this product worth shipping?" through dogfooding by Dan and a hand-onboarded test cohort (STEM Racing kids + Dan's dad). Local-only, no commerce. The Deployed MVP (auth, onboarding for strangers, AWS deploy, Stripe, public landing) is a separate doc, deferred until Functional MVP validates.

Refined iteratively; once each H2 is tight, epic docs spin out per item.

The 5 MVP items

Each H2 below gets refined (design questions, dependencies, implementation direction) before spinning out to its own epic doc.

Goal

Validate the product locally — to Dan first, then to a hand-onboarded test cohort — before any commerce wrap. The Brehob deal outcome (post-Andy-meeting 2026-05-15) funds the commerce wrap if it lands; either way, Functional MVP comes first. A chargeable shell wrapped around a mediocre product is worse than a great product without billing.

Two MVPs

Autri ships in two distinct MVPs stacked on each other:

Functional MVP (this doc)Deployed MVP (separate doc, deferred)
Question it answersIs this product worth shipping?Can we charge for it?
AudienceDan + hand-onboarded test cohortFirst paying customers
Validation surfaceLocal dev environmentProduction (autri.ai)
In scopeThe 5 items below + visual QA via inspectorAuth, RLS enforcement, sample KB / onboarding for strangers, AWS-native deploy, Bedrock prod path, Stripe + tier enforcement, public landing
Test corporaSTEM Racing PDFs + Dad's QuoteAI corpusAdd: real customer corpora as they sign on

Current state — what's already built

Ingestion (D10/D11):

  • PDF text-layer first, vision fallback for figure-heavy/sparse pages
  • Operations-based extractor (LLM semantics, code mechanics) — verbatim content + bbox by construction
  • Haiku default, Sonnet retry button per-doc
  • Continue-on-error per-file

Retrieval (D4/D14):

  • Three KB-scoped primitives: lookup_section, fts_search, vector_search (in @autri/retrieval)
  • Agentic router via MCP (stdio for dev): @autri/mcp-doc-search
  • retrieval_log table powers source-of-result attribution

UI:

  • /docs — flat doc list with confidence tier chips
  • /docs/[id] — inspector with bbox overlays + parsed text
  • /docs/[id]/query — query playground with color-coded source-of-result traces
  • /kb — KB list
  • /kb/[id] — doc list within a KB
  • /kb/[id]/chat — chat surface with markdown answers, source chips, bbox preview

Schema (D13/D14/D15):

  • Multi-tenancy FK chain
  • Logical-documents + supersession + default-latest filter

Pilot state: 1272 chunks across 2 STEM Racing PDF docs. Default org Hannah Labs, default KB STEM Racing Charlotte.

Test cohort and corpora

Two cohorts, two corpus shapes — chosen for maximum surface coverage with zero new-acquisition cost. Compounds across QuoteAI (Dad's corpus is already ingested there) and Autri.

STEM Racing kids (Charlotte team):

  • Corpus: World Finals Technical + Competition Regulations (already ingested — 1272 chunks)
  • Content shape: figure-heavy PDFs, technical regulations
  • Query pattern: rule lookup + creative-interpretation ("can we use X mod given C7.6.2?")
  • Tests: PDF path + figure handling + rule-lookup retrieval

Dad's QuoteAI corpus (Brehob quote history):

  • Subset (a): structured quote spreadsheets (XLSX) — past-quote line-items in QuoteAI's format
  • Subset (b): raw past-quote PDFs — the originals before being spreadsheet-ified
  • Content shape: business documents, table-heavy, figure-light
  • Query pattern: "what did we quote for similar scenarios?" semantic search across past work
  • Tests: XLSX path + PDF path on totally different content type than STEM (cross-product compound)

Together: two real users with two distinct content shapes, four source-type exercises (PDF technical, PDF business, XLSX structured, eventually DOCX as item 3 expands).

1. UI/UX flow

Problem. A lot of functionality exists — inspector, chat, KB nav, query playground — but the UX flow doesn't yet tell a real user "the system is working and I trust it." Today ingestion is CLI-only, navigation is a flat doc list, retrieval traces only surface on the playground page (not chat).

MVP needs:

Chat as the homepage.

  • Familiar webchat pattern (Claude-web-chat feel). Center column for chat; right rail for sources (initial lean — iterate until it feels right).
  • Color-coded retrieval traces surfaced inline with each source — the differentiator can't live only on the playground page.
  • Multi-turn chat history threaded into the router.
  • Inline [N] citations rendered in the assistant answer body.
  • Empty state: if user has zero KBs, redirect to the KB upload page (or block chat input with a "create a KB first" prompt).

Knowledge base management view.

  • List existing KBs, create new KB, delete KB (no per-KB ACL UI in Functional MVP — that's Deployed).
  • Click into a KB → high-level source-doc view (today's /kb/[id], polished).

Ingestion pipeline view (the cool one).

  • Pipeline-style flow visualization with stages: File upload → Ingestion → Agent validation → Human review → Ready
  • Pattern reference: QuoteAI's streaming-checklist for quote generation — gives the user confidence the system is doing real work.
  • Status doesn't need to be real-time-streaming (polling is fine); a status bar that updates per stage is enough.
  • Per-file failure visibility — when one doc fails (vision timeout, parse error, schema validation failure), show why without blowing up the batch.

Source-doc drill-in.

  • Today's /docs/[id] inspector is the core surface — polished.
  • Need: navigate within a doc (page-by-page, section-by-section).

Retrieval trace surface inside chat.

  • The color-coded "which index returned this chunk" visualization is a first-class part of the chat UX, not a separate playground page. Trust comes from legibility.

Empty states + failure UX.

  • No KBs yet → routed to upload page
  • KB has no docs → empty state with "Upload your first document" CTA
  • Ingestion failures → per-doc error display, retry, partial-success
  • LLM returns invalid JSON → automatic Sonnet retry, fall through to "needs human review" tier

Branding/visual design: Claude Design handles the visual layer (typography, color, layout polish) in parallel.

Design questions to refine:

  • (a) Right-rail sources or different layout? Initial lean is right-rail (matches QuoteAI's rail pattern, keeps chat as focal column). Iterate until it feels right.
  • (b) Async ingestion: poll or SSE? Probably polling for v1 (one less moving part); upgrade to SSE if it feels slow.
  • (c) Bulk upload of related docs. If user drops 50 files, treat as 50 separate docs or one logical doc with 50 sections? Probably 50 separate docs.

Dependencies: none — parallelizable with the extraction spike. Out of scope (Deployed MVP or later): onboarding for strangers, sample KB / try-without-uploading, per-KB ACLs, mobile responsiveness.

2. MCP over SSE + OAuth

Problem. Current MCP is stdio (@autri/mcp-doc-search). Stdio is local-only — won't survive deploy. The strategic positioning ("be the MCP server they consume") demands hosted SSE. Per D5 pruned-note: stdio MCP is planned for retirement in favor of SSE+OAuth.

MVP needs:

  • SSE transport for the doc-search server (retire stdio)
  • OAuth 2.0 flow for token issuance — user clicks "Connect Claude / Cursor / Whatever" → consent → token issued
  • Token scope model: per-token list of allowed KB IDs + allowed tools
  • Token management UI in-app: list active, revoke, re-scope
  • Lift E12 wholesale from Foundry — already designed there

Design questions to refine:

  • (a) Local dev OAuth: real Cognito dev pool, or stub? Lean: real Cognito dev pool — normalizes the auth pathway from day one (~1-2 day setup), no two-pathway drift. Reuses primitive between Functional and Deployed MVP.
  • (b) Tool surface on the MCP server. Add list_knowledge_bases and list_documents to support D17 (hybrid agent + KB selection). Figure access is v1.5.
  • (c) Per-tool authorization. Read-only vs. write tokens? Probably yes — defaults to read-only.

Dependencies: local OAuth setup (real Cognito or stub) for issuance. Out of scope: rate-limiting per-token (Deployed MVP), audit log surface, OAuth client registration UI.

3. Multi-doc-type extraction (spike-and-iterate)

Problem. Today's extractor is PDF-vision-first. DOCX/XLSX/MD need different parsing approaches. Figures/diagrams in PDFs are bbox-overlaid but lack semantic content.

Approach: spike first, design after. Per process.md: "Make design decisions during implementation as they surface — that's when the real constraints are visible." Trying to design the perfect agent.md schema before we've shipped DOCX is exactly the kind of design-before-real-constraints the methodology warns against.

Spike plan:

  1. Refactor current PDF extractor into extractors/pdf/ — defines what a doc-type extractor IS structurally.
  2. Build DOCX as the second type — validates that the abstraction holds.
  3. Iterate against the test corpora (STEM Racing + Dad's quotes) — what works, what doesn't, what cross-type abstraction emerges.
  4. Once two types ship, write up the abstraction that actually emerged (not the one we guessed). That writeup becomes the proper design doc.
  5. Then expand to Markdown/plain text and XLSX.

Source types in priority order:

  1. PDF (refactor existing → first source-type-specific extractor)
  2. DOCX (validates abstraction; key for author segment + Dad's templates)
  3. Plain text + Markdown (trivial off DOCX; covers dev/prosumer)
  4. XLSX (Dad's quote spreadsheets — exercises a totally different shape than text-flow docs)

Agent-validation pipeline stage (new).

Today's pipeline: extract → finalize (compute confidence tier) → human review.

Proposed new stage between extract and human review: agent validation.

  • LLM reads extracted chunks against the source pages
  • Flags suspicious chunks for human review (hallucinated content, structurally wrong, semantically off)
  • Auto-approves high-confidence chunks
  • Tightens the human-review loop — humans only see what needs human judgment

This complements D11 (operations-based extractor → verbatim text by construction) by catching errors in semantic chunking even when the text is verbatim.

Figures/pictures in PDFs (sub-problem):

  • Today: figures are bbox-overlaid; chunks referencing them have no semantic content of their own
  • Proposed: describe_figure operation per figure region → Haiku vision generates text description → stored as chunk_type = 'figure_description', embedded with surrounding text context
  • Cost: ~$0.0003 per figure (tractable)
  • Spike-test this alongside the doc-type work

Open during the spike (not pre-decided):

  • agent.md schema (structured vs. freeform)
  • DOCX chunking strategy (paragraph-bound, heading-bound, hybrid)
  • XLSX semantics (named ranges, detected tables, cell-level)
  • Figure description embedding (text-only with context vs. multi-modal CLIP-style — text-only likely right)

Dependencies: test corpora (already have). Out of scope: OCR'd PDFs (image-only, no text layer), HTML/URL ingestion, audio/video.

4. KB update flow

Problem. Today: manual one-doc-at-a-time CLI ingest. Authors and legal teams accumulate docs continuously; they need a recurring update path. (Manual upload UI lands as part of item 1; the more nuanced version-detection lives here.)

MVP needs:

  • Manual upload covers Functional MVP (drag-drop in item 1's surface)
  • D15 version-detection heuristic on upload — filename + title + structural overlap auto-supersedes prior versions
  • Update vs. supersede UX — when uploading a new doc that matches an existing logical-name, surface "looks like a new version of X — supersede?" prompt; don't auto-supersede silently
  • Re-extract trigger — button on doc inspector to re-run extraction with the current extractor version (useful as the extractor improves during the spike)

Design questions:

  • (a) Manual confirm vs. auto-supersede. Lean: confirm. Cheaper to be wrong than to silently overwrite.
  • (b) Bulk re-extract. If extractor improves significantly mid-spike, offer "re-extract all docs in this KB"? Probably yes as a doc-level action; surface in the inspector.

Dependencies: D15 version-detection heuristic needs implementation (designed in Foundry, not built). Out of scope (post-MVP / Deployed MVP): Google Drive folder sync, Dropbox sync, SharePoint, S3 bucket, Git repo sync, scheduled refreshes, webhook-driven ingest.

5. File diff mechanism

Problem. When v2 of a doc is uploaded, show "what changed since v1." Schema is in (D15); algorithm + UI are not.

Approach: chunk-level structural diff, not git diff.

Git-diff is the wrong primitive here. Git diffs are line-based on text; our content is chunk-based with embeddings and bboxes; PDF binaries don't diff. Instead:

  • For each new version, run version-detection heuristic (D15) → match to existing logical_document
  • For each matched logical doc, run chunk-level diff:
    • For each new chunk, find nearest neighbor in prior version (cosine distance + section ID match)
    • High similarity (>0.95, same section ID) → unchanged
    • Medium → changed (render side-by-side text diff)
    • No match → added
    • Old chunks unmatched → removed
  • Surface in /kb/[id]/doc/[id]/diff?compare=v1,v2: green/yellow/red sections

Git-diff IS useful for the content of a single matched chunk — line-by-line git-style diff for inner-chunk rendering. Git-diff for inner-chunk view, structural diff for inter-chunk view.

Build sequence:

  1. Version-detection heuristic on upload (part of item 4)
  2. Chunk-level diff algorithm
  3. /diff UI

Design questions:

  • (a) Diff scope. Just v_prev vs v_current, or arbitrary version-pair (v1 vs v5)? Lean: arbitrary-pair.
  • (b) Inner-chunk rendering. Word-level vs. line-level diff? Probably both — toggle in UI.

Dependencies: item 4 (need multiple versions to diff); item 1 (UI home for the diff view). Out of scope: three-way merge, branch-based collaboration, semantic-change classification.

Sequencing — order of attack

Most items parallelize cleanly.

Parallel track A — Visual design (Claude Design): UI/UX visual design for item 1 — chat pattern, KB management, ingestion pipeline view, doc drill-in.

Parallel track B — Extraction spike (Dan): Item 3 spike-and-iterate against test corpora. PDF refactor first, then DOCX. Iterate. Write up emergent abstraction.

Parallel track C — MCP foundation (whenever): Item 2 — independent of the others. Probably wait until Cognito dev pool setup is convenient anyway.

Serial after tracks A + B converge:

  • Item 1 implementation (UX flow) — uses both the visual design and the now-shipped multi-doc-type extractor
  • Item 4 (KB update flow) — needs item 1's upload UI shipped
  • Item 5 (file diff) — needs item 4's multiple-version state

What's NOT in Functional MVP

Explicitly cut from Functional MVP:

Moved to Deployed MVP:

  • Auth + multi-tenancy enforcement (RLS policies, scope-passing contract)
  • Onboarding for strangers (sample KB, "try without uploading," first-run tour)
  • AWS-native deploy (ECS, RDS, S3, CloudFront)
  • Bedrock prod path swap (D16)
  • Stripe + tier enforcement
  • Public landing page
  • Per-KB ACLs / sharing

Post-MVP (after Deployed MVP):

  • Quality baseline / regression-test loop (visual QA via inspector + chat is enough for Functional MVP)
  • Auto-sync integrations (Google Drive, Dropbox, SharePoint, S3, Git)
  • Data export / KB dump
  • HTML/URL ingestion
  • OCR for image-only PDFs
  • Multi-modal embeddings beyond text
  • Pro tier, Studio tier, Enterprise features
  • Hover popovers on source chips, reciprocal rank fusion, pgvector tuning beyond lists=20

Success criteria — how do we know Functional MVP is done?

Concrete behaviors that count as success:

  1. Inspector trust. Dan drops a fresh 50-page PDF, watches the ingestion pipeline view, and trusts the extraction without manual chunk review. Color-coded retrieval traces explain every chunk surfaced in chat.
  2. Multi-doc-type breadth. PDFs, DOCX, MD/plain text work cleanly. XLSX is a stretch goal but probably done given Dad's spreadsheets.
  3. Hosted MCP works. Dan connects Claude Desktop / Cursor / Claude Code to a local Autri MCP server via OAuth, scopes a token to one KB, and runs successful agent queries.
  4. STEM Racing kids cohort. At least one kid imports the regs (or uses the existing KB) and runs 5 creative-interpretation queries with correctly cited sources, without Dan's intervention.
  5. Dad cohort. Dad's spreadsheets and raw PDFs both ingest cleanly; he runs at least 3 "what did we quote for similar scenarios?" queries that return useful results (validates the cross-product compound thesis).
  6. Versioning + diff works. Dan re-uploads a STEM Racing reg with edits and the diff view shows the changes correctly across both inter-chunk and inner-chunk views.

If 5 of 6 land, Functional MVP is done. Move to Deployed MVP.

Open questions / refinement areas

Highest-leverage refinement areas across items:

  1. The right-rail-vs-other-layout decision (item 1) — drives the rest of the navigation model.
  2. Agent-validation prompt design (item 3) — what does the validator look for? Spike will tell us.
  3. Local OAuth approach (item 2) — drives how much Deployed-MVP foundation we accidentally land in Functional MVP.

These three set cones for everything else.

Next steps

  1. Refine this doc together (this session + future sessions). Surface design questions as we hit them; iterate.
  2. Once H2s are tight, spin out epic docs per item: projects/autri/epics/e1-ui-ux.md, e2-mcp-sse-oauth.md, e3-multi-doc-type.md, e4-update-flow.md, e5-file-diff.md.
  3. Each epic gets implementation-grade detail before any feature branch opens.
  4. Claude Design takes UI/UX visual design as a parallel track.

Review

🔒

Enter your access token to view annotations